Skip to main content

Native .Net Library

Prerequisites

pdf2Data .NET SDK requires .NET Framework v4.6.1 to be installed on your system. pdf2Data SDK doesn't support .NET Core runtime.

System Requirements

  • Recommended minimal hardware configuration:
    • 2 core CPU
    • Memory: 2 GB
    • Temp storage: 2 GB free disk space

Installation

For .NET iText pdf2Data is distributed as a NuGet package which is available at NuGet.org or at iText Artifactory.

You can browse for the desired NuGet package manually or install it with the Install-Package itext7.pdf2data NuGet Package Manager command. In addition, if you are going to use remote license volume reporting (recommended) you need to run Install-Package itext.licensing.remote -Version <compatible-licensing-version> (see Compatibility Matrix for exact version).

Using pdf2Data from your code

As from pdf2Data 4.0, the format of extraction templates has been changed, compared to pdf2Data 3.*. Please see the Migration guide to get to know more

With the pdf2Data UI (pdf2Data 4.0+), you can download templates optimized for use in the pdf2Data SDK.

1. Load the pdf2Data license

Make sure to load the license file before invoking any code

LicenseKey.LoadLicenseFile(pathToLicenseFile);

2. Create an extractor

pdf2Data extractor can be created using an extraction template downloaded from pdf2Data UI

The initialization of the Pdf2DataExtractor instance from a processed template should now be done with one function call:

Pdf2DataExtractor extractor = Pdf2DataExtractor.Create(new FileInfo(P2D_TEMPLATE_PATH));
tip

The extractor can be re-used multiple times, to process batch of pdf files in the loop

3. Extract data from PDF

RecognitionResultHolder result = extractor.Extract(new FileInfo(PDF_PATH));

You can use extracted values directly from the result or save them in one of two structured formats.

4. Get results for specific data field

You can get all results as sorted dictionary by calling:

SortedDictionary<String, DataFieldResult> allResults = result.GetDataFieldResults();

To get results for specific data field use this call:

IList<AbstractValueResult> dataFieldResult = allResults.Get(DATAFIELD_NAME).GetResults();

Results objects have similar structure to described in Recognition result specification.

5. Save extracted data

tip

By default, your data will be saved without metadata. To include it in the result, you should use method overloads with passing next SerializationProperties:

SerializationProperties properties = new SerializationProperties().SetIncludeMetaData(true);
XML
// If you want to write results directly into file
result.WriteToXml(new FileInfo(RESULT_XML_PATH));

// writing result directly to HTTP response
result.WriteToXml(Response.Body); // any other Stream implementation can be passed here

To save result with metadata

// save to file
result.WriteToXml(new FileInfo(RESULT_XML_PATH), properties);

// writing result directly to HTTP response
result.WriteToXml(Response.Body, properties); // any other Stream implementation can be passed here
JSON
// If you want to write results directly into file
result.WriteToJson(new FileInfo(RESULT_JSON_PATH));

// writing result directly to HTTP response
result.WriteToJson(Response.Body); // any other Stream implementation can be passed here

To save result with metadata

// save to file
result.WriteToJson(new FileInfo(RESULT_JSON_PATH), properties);

// writing result directly to HTTP response
result.WriteToJson(Response.Body, properties); // any other Stream implementation can be passed here

Full code sample

LicenseKey.LoadLicenseFile(pathToLicenseFile);

Pdf2DataExtractor extractor = Pdf2DataExtractor.Create(new FileInfo(P2D_TEMPLATE_PATH));
RecognitionResultHolder result = extractor.RecognizeOnPdf(new FileInfo(PDF_PATH));

// If you want to write results directly into file.
result.WriteToXml(new FileInfo(RESULT_XML_PATH));
result.WriteToJson(new FileInfo(RESULT_JSON_PATH));

// If you want to write results directly to HTTP response (or other Stream)
result.WriteToXml(Response.Body);
result.WriteToJson(Response.Body);

// If you want directly access result objects to further save to e.g. DB or other structured storage:
// all results:
SortedDictionary<String, DataFieldResult> allResults = result.GetResult().GetDataFieldResults();
IList<AbstractValueResult> dataFieldResult = allResults.Get(DATAFIELD_NAME).GetResults();

Using Optical Character Recognition (OCR)

In order to be able to use OCR you will need to do additional configurations in your application.

1. Create OCR engine instance

OcrWithPostProcessingEngine engine = Tesseract4BasedEngine.CreateBuilder(JavaCollectionsUtil.SingletonList<String>("eng"), new FileInfo(PATH_TO_TRAINED_DATA)).Build();

2. Create an extractor

// Engine can be null if you want to perform standard pdf data extraction.
Pdf2DataExtractor extractor = Pdf2DataExtractor.Create(new FileInfo(P2D_TEMPLATE_PATH), engine);

3. Create recognition properties

// If you want extract data from standard pdf
RecognitionProperties recognitionProperties = RecognitionProperties.CreateForPdfFile();
// If you want extract data from image
RecognitionProperties recognitionProperties = RecognitionProperties.CreateForImageFile();
// If you want extract data from scanned document
RecognitionProperties recognitionProperties = RecognitionProperties.CreateForPdfFileWithOcr();

4. Extract data from sample

RecognitionResultHolder result = extractor.Extract(new FileInfo(PATH_TO_SAMPLE_FILE), recognitionProperties);

Full code sample with OCR

LicenseKey.LoadLicenseFile(pathToLicenseFile);

OcrWithPostProcessingEngine engine = Tesseract4BasedEngine.CreateBuilder(JavaCollectionsUtil.SingletonList<String>("eng"), new FileInfo(PATH_TO_TRAINED_DATA)).Build();

// Engine can be null if you want to perform standard pdf data extraction.
Pdf2DataExtractor extractor = Pdf2DataExtractor.Create(new FileInfo(P2D_TEMPLATE_PATH), engine);

// If you want extract data from standard pdf
RecognitionProperties recognitionProperties = RecognitionProperties.CreateForPdfFile();
// If you want extract data from image
RecognitionProperties recognitionProperties = RecognitionProperties.CreateForImageFile();
// If you want extract data from scanned document
RecognitionProperties recognitionProperties = RecognitionProperties.CreateForPdfFileWithOcr();

RecognitionResultHolder result = extractor.Extract(new FileInfo(PATH_TO_SAMPLE_FILE), recognitionProperties);

// If you want to write results directly into file.
result.WriteToXml(new FileInfo(RESULT_XML_PATH));
result.WriteToJson(new FileInfo(RESULT_JSON_PATH));

// If you want to write results directly to HTTP response (or other Stream)
result.WriteToXml(Response.Body);
result.WriteToJson(Response.Body);

// If you want directly access result objects to further save to e.g. DB or other structured storage:
// all results:
SortedDictionary<String, DataFieldResult> allResults = result.GetResult().GetDataFieldResults();
IList<AbstractValueResult> dataFieldResult = allResults.Get(DATAFIELD_NAME).GetResults();