Skip to main content

Using pdf2Data Rest API

important

This example uses configuration values for the RESTful service that can be set up during container deployment.
We assume that pdf2Data REST API Engine is available on localhost:8080, and the authorization token is "AUTH_TOKEN". Please replace these values with ones you have configured

As for other engine types, one who wants to use REST API service should perform 4 operations:

Upload license

It must be done once per service per license.

curl -X 'POST' \
https://localhost:8080/api/v2/license' \
-H 'accept: application/json' \
-H 'Authorization: Bearer AUTH_TOKEN' \
-H 'Content-Type: multipart/form-data' \
-F 'licenseFile=@pdf2data_license.json;type=application/json'

Register template

Register template in Engine (once per template per instance)

curl -X 'POST' \
'https://localhost:8080/api/v2/templates' \
-H 'accept: application/json' \
-H 'Authorization: Bearer AUTH_TOKEN' \
-H 'Content-Type: multipart/form-data' \
-F 'templateArchive=@template_for_sdk.p2d'

Response:

{
"id": "templateID",
"name": "template_for_sdk",
"description": "test"
}

In the next step, we will use this id: "templateID" to specify which template must be used for parsing:

important

You should use "processed" templates. (i.e. one that has a *.p2d extension).
You can get a "processed" template by clicking on the "Download for SDK" button in pdf2Data UI with the Manager component

Recognize

The nature of the REST Engine is asynchronous.
Whenever you want to process pdf using pdf2Data engine, you should:

Schedule a recognition job

PDF:

curl -X 'POST' \
'https://localhost:8080/api/v2/jobs' \
-H 'accept: application/json' \
-H 'Authorization: Bearer AUTH\_TOKEN' \
-H 'Content-Type: multipart/form-data' \
-F 'pdf=@FileToParse.pdf;type=application/pdf' \
-F 'jobRequest={
"jobType": "RECOGNIZE",
"templateId": "templateID",
"preprocessingType": "NONE"
}'

Image or scanned PDF:

curl -X 'POST' \
'https://localhost:8080/api/v2/jobs' \
-H 'accept: application/json' \
-H 'Authorization: Bearer AUTH\_TOKEN' \
-H 'Content-Type: multipart/form-data' \
-F 'image=@FileToParse.png;type=image/png' \
-F 'jobRequest={
"jobType": "RECOGNIZE",
"templateId": "templateID",
"preprocessingType": "OCR"
}'

Where:

  • jobRequest.jobType- the type of the job which should be performed:
    • use RECOGNIZE for actual recognition, note that this will consume the license volume;
    • you can also preliminary run CHECK to verify if the result counts are expected, this call won't affect license volume.
  • jobRequest.templateId - id of a template registered in the engine (see point 2.)
  • jobRequest.preprocessingType - preprocessing type for document. Can be NONE or OCR

You receive job ID in the response:

Response (PDF)

{
"jobId": "jobID",
"jobType": "RECOGNIZE",
"jobStatus": "QUEUED",
"templateId": "templateID",
"pdfName": "FileToParse.pdf",
"imageName": null,
"errors": [
"string"
],
"preprocessingType": "NONE"
}

Response (image or scanned PDF)

{
"jobId": "jobID",
"jobType": "RECOGNIZE",
"jobStatus": "QUEUED",
"templateId": "templateID",
"pdfName": null,
"imageName": "FileToParse.png",
"errors": [
"string"
],
"preprocessingType": "OCR"
}

Get recognition result

curl -X 'GET' \
'https://localhost:8080/api/v2/jobs/jobID/result?outputFormat=JSON' \
-H 'accept: */*' \
-H 'Authorization: Bearer AUTH_TOKEN'

Where:

  • outputFormat - the file format in which extracted data must be presented: JSON, XML, JSON_WITH_META, XML_WITH_META.

This call returns the response with extracted values in the specified format.