Native .Net Library
Prerequisites
pdf2Data .NET SDK requires .NET Framework v4.6.1 to be installed on your system. pdf2Data SDK doesn't support .NET Core runtime.
System Requirements
- Recommended minimal hardware configuration:
- 2 core CPU
- Memory: 2 GB
- Temp storage: 2 GB free disk space
Installation
For .NET iText pdf2Data is distributed as a NuGet package which is available at NuGet.org or at iText Artifactory.
You can browse for the desired NuGet package manually or install it with the Install-Package itext7.pdf2data
NuGet Package Manager command. In addition, if you are going to use remote license volume reporting (recommended)
you need to run Install-Package itext.licensing.remote -Version <compatible-licensing-version>
(see Compatibility Matrix for exact version).
Using pdf2Data from your code
As from pdf2Data 4.0, the format of extraction templates has been changed, compared to pdf2Data 3.*. Please see the Migration guide to get to know more
With the pdf2Data UI (pdf2Data 4.0+), you can download templates optimized for use in the pdf2Data SDK.
1. Load the pdf2Data license
Make sure to load the license file before invoking any code
LicenseKey.LoadLicenseFile(pathToLicenseFile);
2. Create an extractor
pdf2Data extractor can be created using an extraction template downloaded from pdf2Data UI
The initialization of the Pdf2DataExtractor instance from a processed template should now be done with one function call:
Pdf2DataExtractor extractor = Pdf2DataExtractor.Create(new FileInfo(P2D_TEMPLATE_PATH));
The extractor can be re-used multiple times, to process batch of pdf files in the loop
3. Extract data from PDF
RecognitionResultHolder result = extractor.Extract(new FileInfo(PDF_PATH));
You can use extracted values directly from the result or save them in one of two structured formats.
4. Get results for specific data field
You can get all results as sorted dictionary by calling:
SortedDictionary<String, DataFieldResult> allResults = result.GetDataFieldResults();
To get results for specific data field use this call:
IList<AbstractValueResult> dataFieldResult = allResults.Get(DATAFIELD_NAME).GetResults();
Results objects have similar structure to described in Recognition result specification.
5. Save extracted data
By default, your data will be saved without metadata. To include it in the result, you should use method overloads with passing next SerializationProperties:
SerializationProperties properties = new SerializationProperties().SetIncludeMetaData(true);
XML
// If you want to write results directly into file
result.WriteToXml(new FileInfo(RESULT_XML_PATH));
// writing result directly to HTTP response
result.WriteToXml(Response.Body); // any other Stream implementation can be passed here
To save result with metadata
// save to file
result.WriteToXml(new FileInfo(RESULT_XML_PATH), properties);
// writing result directly to HTTP response
result.WriteToXml(Response.Body, properties); // any other Stream implementation can be passed here
JSON
// If you want to write results directly into file
result.WriteToJson(new FileInfo(RESULT_JSON_PATH));
// writing result directly to HTTP response
result.WriteToJson(Response.Body); // any other Stream implementation can be passed here
To save result with metadata
// save to file
result.WriteToJson(new FileInfo(RESULT_JSON_PATH), properties);
// writing result directly to HTTP response
result.WriteToJson(Response.Body, properties); // any other Stream implementation can be passed here
Full code sample
LicenseKey.LoadLicenseFile(pathToLicenseFile);
Pdf2DataExtractor extractor = Pdf2DataExtractor.Create(new FileInfo(P2D_TEMPLATE_PATH));
RecognitionResultHolder result = extractor.RecognizeOnPdf(new FileInfo(PDF_PATH));
// If you want to write results directly into file.
result.WriteToXml(new FileInfo(RESULT_XML_PATH));
result.WriteToJson(new FileInfo(RESULT_JSON_PATH));
// If you want to write results directly to HTTP response (or other Stream)
result.WriteToXml(Response.Body);
result.WriteToJson(Response.Body);
// If you want directly access result objects to further save to e.g. DB or other structured storage:
// all results:
SortedDictionary<String, DataFieldResult> allResults = result.GetResult().GetDataFieldResults();
IList<AbstractValueResult> dataFieldResult = allResults.Get(DATAFIELD_NAME).GetResults();
Using Optical Character Recognition (OCR)
In order to be able to use OCR you will need to do additional configurations in your application.
1. Create OCR engine instance
OcrWithPostProcessingEngine engine = Tesseract4BasedEngine.CreateBuilder(JavaCollectionsUtil.SingletonList<String>("eng"), new FileInfo(PATH_TO_TRAINED_DATA)).Build();
2. Create an extractor
// Engine can be null if you want to perform standard pdf data extraction.
Pdf2DataExtractor extractor = Pdf2DataExtractor.Create(new FileInfo(P2D_TEMPLATE_PATH), engine);
3. Create recognition properties
// If you want extract data from standard pdf
RecognitionProperties recognitionProperties = RecognitionProperties.CreateForPdfFile();
// If you want extract data from image
RecognitionProperties recognitionProperties = RecognitionProperties.CreateForImageFile();
// If you want extract data from scanned document
RecognitionProperties recognitionProperties = RecognitionProperties.CreateForPdfFileWithOcr();
4. Extract data from sample
RecognitionResultHolder result = extractor.Extract(new FileInfo(PATH_TO_SAMPLE_FILE), recognitionProperties);
Full code sample with OCR
LicenseKey.LoadLicenseFile(pathToLicenseFile);
OcrWithPostProcessingEngine engine = Tesseract4BasedEngine.CreateBuilder(JavaCollectionsUtil.SingletonList<String>("eng"), new FileInfo(PATH_TO_TRAINED_DATA)).Build();
// Engine can be null if you want to perform standard pdf data extraction.
Pdf2DataExtractor extractor = Pdf2DataExtractor.Create(new FileInfo(P2D_TEMPLATE_PATH), engine);
// If you want extract data from standard pdf
RecognitionProperties recognitionProperties = RecognitionProperties.CreateForPdfFile();
// If you want extract data from image
RecognitionProperties recognitionProperties = RecognitionProperties.CreateForImageFile();
// If you want extract data from scanned document
RecognitionProperties recognitionProperties = RecognitionProperties.CreateForPdfFileWithOcr();
RecognitionResultHolder result = extractor.Extract(new FileInfo(PATH_TO_SAMPLE_FILE), recognitionProperties);
// If you want to write results directly into file.
result.WriteToXml(new FileInfo(RESULT_XML_PATH));
result.WriteToJson(new FileInfo(RESULT_JSON_PATH));
// If you want to write results directly to HTTP response (or other Stream)
result.WriteToXml(Response.Body);
result.WriteToJson(Response.Body);
// If you want directly access result objects to further save to e.g. DB or other structured storage:
// all results:
SortedDictionary<String, DataFieldResult> allResults = result.GetResult().GetDataFieldResults();
IList<AbstractValueResult> dataFieldResult = allResults.Get(DATAFIELD_NAME).GetResults();