Skip to main content

Native Java Library

Prerequisites

pdf2Data Java SDK requires Java 8, Java 11 or Java 17 to be installed on your system.

We guarantee software compatibility with the Oracle JRE 8 and Open JRE 11/17.

We recommend using at least 1.5GB of Java heap space, and 500MB per each additional thread.

System Requirements

  • Recommended minimal hardware configuration:
    • 2 core CPU
    • Memory: 2 GB
    • Temp storage: 2 GB free disk space

While the Java SDK will work fine on a single core, we recommend using multiple cores in cases where you handle documents in parallel using separate threads (one document per thread).

Installation

The preferred way to set up iText pdf2Data in Java is to use a build system like Maven or Gradle and download pdf2Data artifacts from the iText Artifactory

The groupId is com.itextpdf.pdf2data, and the artifactId is pdf2data

In Maven, the configuration would look similar to the example below:

Maven

Add the pdf2Data repository to the <repositories> section.

<repositories>
<repository>
<id>pdf2Data</id>
<name>pdf2Data Maven Repository</name>
<url>https://repo.itextsupport.com/pdf2data</url>
</repository>
<repository> <!-- can be skipped if license is unlimited or local reporting is going to be configured -->
<id>itext-releases</id>
<name>iText Repository-releases</name>
<url>https://repo.itextsupport.com/releases</url>
</repository>
</repositories>

And dependency to <dependencies>

<dependencies>
<dependency>
<groupId>com.itextpdf.pdf2data</groupId>
<artifactId>pdf2data</artifactId>
<version>5.0.1</version>
</dependency>
<dependency> <!-- can be skipped if license is unlimited or local reporting is going to be configured -->
<groupId>com.itextpdf.licensing</groupId>
<artifactId>licensing-remote</artifactId>
<version>4.1.3</version>
</dependency>
<dependencies>

Using pdf2Data from your code

As from pdf2Data 4.0, the format of extraction templates has been changed, compared to pdf2Data 3.*. Please see the Migration guide to get to know more

With the pdf2Data UI (pdf2Data 4.0+), you can download templates optimized for use in the pdf2Data SDK.

1. Load the pdf2Data license

Make sure to load the license file before invoking any code

LicenseKey.loadLicenseFile(pathToLicenseFile);

2. Create an extractor

pdf2Data extractor can be created using an extraction template downloaded from pdf2Data UI

The initialization of the Pdf2DataExtractor instance from a processed template should now be done with one function call:

Pdf2DataExtractor extractor = Pdf2DataExtractor.create(new File(P2D_TEMPLATE_PATH));
tip

The extractor can be re-used multiple times, to process batch of pdf files in the loop

3. Extract data from PDF

RecognitionResultHolder result = extractor.extract(new File(PDF_PATH));

You can use extracted values directly from the result or save them in one of two structured formats

4. Get results for specific data field

You can get all results as sorted map by calling:

SortedMap<String, DataFieldResult> allResults = result.getDataFieldResults();

To get results for specific data field use this call:

List<AbstractValueResult> dataFieldResult = allResults.get(DATAFIELD_NAME).getResults();

Results objects have similar structure to described in Recognition result specification, you can also consult SDK JavaDocs.

5. Save extracted data

tip

By default, your data will be saved without metadata. To include it in the result, you should use method overloads with passing next SerializationProperties:

SerializationProperties properties = new SerializationProperties().setIncludeMetaData(true);
XML
// If you want to write results directly into file.
result.writeToXml(new File(RESULT_XML_PATH));

// writing result directly to HTTP response
result.writeToXml(response.getOutputStream()); // any other OutputStream implementation can be passed here

To save result with metadata

// save to file
result.writeToXml(new File(RESULT_XML_PATH), properties);

// writing result directly to HTTP response
result.writeToXml(response.getOutputStream(), properties); // any other OutputStream implementation can be passed here
JSON
// If you want to write results directly into file.
result.writeToJson(new File(RESULT_JSON_PATH));

// writing result directly to HTTP response
result.writeToJson(response.getOutputStream()); // any other OutputStream implementation can be passed here

To save result with metadata

// save to file
result.writeToJson(new File(RESULT_JSON_PATH), properties);

// writing result directly to HTTP response
result.writeToJson(response.getOutputStream(), properties); // any other OutputStream implementation can be passed here

Full code sample

LicenseKey.loadLicenseFile(pathToLicenseFile);

Pdf2DataExtractor extractor = Pdf2DataExtractor.create(new File(P2D_TEMPLATE_PATH));
RecognitionResultHolder result = extractor.extract(new File(PDF_PATH));

// If you want to write results directly into file.
result.writeToXml(new File(RESULT_XML_PATH));
result.writeToJson(new File(RESULT_JSON_PATH));

// writing result directly to HTTP response
result.writeToXml(response.getOutputStream()); // any other OutputStream implementation can be passed here
result.writeToJson(response.getOutputStream());

// If you want directly access result objects to further save to e.g. DB or other structured storage:
// all results:
SortedMap<String, DataFieldResult> allResults = result.getResult().getDataFieldResults();
List<AbstractValueResult> dataFieldResult = allResults.get(DATAFIELD_NAME).getResults();

Using Optical Character Recognition (OCR)

In order to be able to use OCR you will need to do additional configurations in your application.

1. Add one more dependency

<dependencies>
<!-- pdf2data dependencies-->
...
<dependency>
<groupId>com.itextpdf.pdf2data</groupId>
<artifactId>pdf2data-default-ocr-engine</artifactId>
<version>5.0.1</version>
</dependency>
</dependencies>

2. Create OCR engine instance

OcrWithPostProcessingEngine engine = Tesseract4BasedEngine
.builder(Collections.<String>singletonList("eng"), new File(PATH_TO_TRAINED_DATA))
// Optional table model initialization for OCR based tables recognition.
.enableTATRPostProcessing()
.build();

Where: PATH_TO_TRAINED_DATA - Path to OCR trained models which will be used for training tesseract software. In our tests we use https://github.com/itext/i7j-pdfocr/raw/3.0.1/pdfocr-tesseract4/src/test/resources/com/itextpdf/pdfocr/tessdata/eng.traineddata

Optional: OCR table detection for better tables recognition can be turned on by calling Pdf2DataTATRPostProcessorStaticInitializer#initializeStaticModels (new File(PATH_TO_TABLE_DETECTION_MODEL), new File(PATH_TO_TABLE_STRUCTURE_MODEL)) and providing model files for table detection and table structure. PATH_TO_TABLE_DETECTION_MODEL - Path to table detection model (https://pdf2data-public-resources.s3.eu-central-1.amazonaws.com/tatr/2023-09-07/table-detection.onnx). PATH_TO_TABLE_STRUCTURE_MODEL - Path to table structure model (https://pdf2data-public-resources.s3.eu-central-1.amazonaws.com/tatr/2023-09-07/table-structure.onnx).

3. Create an extractor

//engine can be null if you want to perform standard pdf data extraction.
Pdf2DataExtractor extractor = Pdf2DataExtractor.create(new File(P2D_TEMPLATE_PATH), engine);

4. Create recognition properties

//if you want extract data from standard pdf
RecognitionProperties recognitionProperties = RecognitionProperties.createForPdfFile();
//if you want extract data from image
RecognitionProperties recognitionProperties = RecognitionProperties.createForImageFile();
//if you want extract data from scanned document
RecognitionProperties recognitionProperties = RecognitionProperties.createForPdfFileWithOcr();

5. Extract data from sample

RecognitionResultHolder result = extractor.extract(new File(PATH_TO_SAMPLE_FILE), recognitionProperties);

Full code sample with OCR

LicenseKey.loadLicenseFile(pathToLicenseFile);
Pdf2DataTATRPostProcessorStaticInitializer#initializeStaticModels
(new File(PATH_TO_TABLE_DETECTION_MODEL), new File(PATH_TO_TABLE_STRUCTURE_MODEL));

OcrWithPostProcessingEngine engine = Tesseract4BasedEngine
.builder(Collections.<String>singletonList("eng"), new File(PATH_TO_TRAINED_DATA))
// Optional table model initialization for OCR based tables recognition.
.enableTATRPostProcessing()
.build();

Pdf2DataExtractor extractor = Pdf2DataExtractor.create(new File(P2D_TEMPLATE_PATH), engine);

//if you want extract data from image
RecognitionProperties recognitionProperties = RecognitionProperties createForImageFile();
//if you want extract data from scanned document
RecognitionProperties recognitionProperties = RecognitionProperties createForPdfFileWithOcr();
//if you want extract data from standard pdf
RecognitionProperties recognitionProperties = RecognitionProperties createForPdfFile();

RecognitionResultHolder result = extractor.extract(new File(PATH_TO_SAMPLE_FILE), recognitionProperties);

// If you want to write results directly into file.
result.writeToXml(new File(RESULT_XML_PATH));
result.writeToJson(new File(RESULT_JSON_PATH));

// writing result directly to HTTP response
result.writeToXml(response.getOutputStream()); // any other OutputStream implementation can be passed here
result.writeToJson(response.getOutputStream());

// If you want directly access result objects to further save to e.g. DB or other structured storage:
// all results:
SortedMap<String, DataFieldResult> allResults = result.getResult().getDataFieldResults();
List<AbstractValueResult> dataFieldResult = allResults.get(DATAFIELD_NAME).getResults();