Custom OCR Engine

You can override pdf2Data's Tesseract 4-based OCR to use alternative OCR engines. On this page we show how to implement a custom OCR engine based on the LEADTools OCR engine, with additional table detection post-processor usage.

1. Implement IOcrEngine

OCR functionality in pdf2Data uses the iText pdfOCR module as its basis to prepare PDF files for further data extraction. This means that the process of adding custom OCR engines is based on the iText pdfOCR API.

First, create an implementation of the IOcrEngine interface. This is the main entrypoint to add a custom engine. The main purpose of the created implementation is to do the OCR processing, and map the obtained results into the iText pdfOCR result API.

Java: IOcrEngine Implementation Based on LEADTools

import com.itextpdf.kernel.geom.Rectangle;
import com.itextpdf.pdfocr.IOcrEngine;
import com.itextpdf.pdfocr.OcrProcessContext;
import com.itextpdf.pdfocr.TextInfo;

import java.io.File;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import leadtools.LeadRectD;
import leadtools.document.DocumentFactory;
import leadtools.document.DocumentPage;
import leadtools.document.DocumentPageText;
import leadtools.document.DocumentWord;
import leadtools.document.LEADDocument;
import leadtools.document.LoadDocumentOptions;
import leadtools.ocr.OcrEngine;
import leadtools.ocr.OcrEngineManager;
import leadtools.ocr.OcrEngineType;
import leadtools.ocr.OcrSettingManager;

public class CustomEngine implements IOcrEngine {

    @Override
    public Map<Integer, List<TextInfo>> doImageOcr(File input) {
        throw new UnsupportedOperationException();
    }

    @Override
    public Map<Integer, List<TextInfo>> doImageOcr(File file, OcrProcessContext ocrProcessContext) {
        // Create the OCR engine
        OcrEngine ocrEngine = OcrEngineManager.createEngine(OcrEngineType.LEAD);
        ocrEngine.startup(null, null, null,
                new File(PDF_2_DATA_OCR_LEADTOOLS_RUNTIME_PATH).getAbsolutePath());

        // Set necessary OCR settings if needed using settingManager.setValue()
        OcrSettingManager settingManager = ocrEngine.getSettingManager();

        // Load the document
        LoadDocumentOptions options = new LoadDocumentOptions();
        LEADDocument document = DocumentFactory.loadFromFile(file.getAbsolutePath(), options);
        document.getText().setOcrEngine(ocrEngine);

        // Prepare the OCR result map
        Map<Integer, List<TextInfo>> ocrResult = new HashMap<>();

        // Process each page of the document
        int currentPage = 0;
        for (DocumentPage page : document.getPages()) {
            DocumentPageText pageText = page.getText();
            pageText.buildWords();
            List<DocumentWord> words = pageText.getWords();

            List<TextInfo> currentPageInfos = new ArrayList<>(words.size());
            for (DocumentWord documentWord : words) {
                LeadRectD bounds = documentWord.getBounds();
                TextInfo convert = new TextInfo(documentWord.getValue(),
                        new Rectangle(
                                (float) (bounds.getX()),
                                (float) (bounds.getY()),
                                (float) (bounds.getWidth()),
                                (float) (bounds.getHeight())));

                currentPageInfos.add(convert);
            }
            ++currentPage;
            ocrResult.put(currentPage, currentPageInfos);
        }

        return ocrResult;
    }

    @Override
    public void createTxtFile(List<File> inputImages, File txtFile) {
        throw new UnsupportedOperationException();
    }

    @Override
    public void createTxtFile(List<File> inputImages, File txtFile, OcrProcessContext ocrProcessContext) {
        throw new UnsupportedOperationException();
    }
}

.NET: IOcrEngine Implementation Based on LEADTools

using System;
using System.Collections.Generic;
using System.IO;
using iText.Kernel.Geom;
using iText.Pdfocr;
using Leadtools;
using Leadtools.Document;
using Leadtools.Ocr;
using IOcrEngine = iText.Pdfocr.IOcrEngine;
using LeadtoolsIOcrEngine = Leadtools.Ocr.IOcrEngine;


namespace iText.Pdf2Data.Ocr.Engine
{
    public class CustomEngine : IOcrEngine
    {
        public IDictionary<int, IList<TextInfo>> DoImageOcr(FileInfo input)
        {
            throw new NotImplementedException();
        }

        public IDictionary<int, IList<TextInfo>> DoImageOcr(FileInfo input, OcrProcessContext ocrProcessContext)
        {
            // Create the OCR engine
            LeadtoolsIOcrEngine ocrEngine = OcrEngineManager.CreateEngine(OcrEngineType.LEAD);
            ocrEngine.Startup(null, null, null,
                new FileInfo(PDF_2_DATA_OCR_LEADTOOLS_RUNTIME_PATH).FullName);

            // Set necessary OCR settings if needed using settingManager.SetValue()
            IOcrSettingManager settingManager = ocrEngine.SettingManager;

            // Prepare the OCR result map
            Dictionary<int, IList<TextInfo>> ocrResult = new Dictionary<int, IList<TextInfo>>();

            // Load the document
            using (LEADDocument document = DocumentFactory.LoadFromFile(input.FullName,
                       new LoadDocumentOptions { FirstPageNumber = 1, LastPageNumber = -1 }))
            {
                // Process each page of the document
                int currentPage = 0;
                foreach (DocumentPage page in document.Pages)
                {
                    DocumentPageText pageText = page.GetText();
                    pageText.BuildWords();
                    IList<DocumentWord> words = pageText.Words;

                    List<TextInfo> currentPageInfos = new List<TextInfo>(words.Count);
                    foreach (DocumentWord documentWord in words)
                    {
                        LeadRectD bounds = documentWord.Bounds;
                        TextInfo convert = new TextInfo(documentWord.Value,
                            new Rectangle(
                                (float)bounds.X,
                                (float)bounds.Y,
                                (float)bounds.Width,
                                (float)bounds.Height));

                        currentPageInfos.Add(convert);
                    }
                }

                return ocrResult;
            }
        }

        public void CreateTxtFile(IList<FileInfo> inputImages, FileInfo txtFile)
            {
                throw new NotImplementedException();
            }

            public void CreateTxtFile(IList<FileInfo> inputImages, FileInfo txtFile,
                OcrProcessContext ocrProcessContext)
            {
                throw new NotImplementedException();
            }
        }
    }

2. (Optional) Prepare OCR Post-Processing Logic

Implementations of the IOcrEnginePostProcessor interface can be used to add some post-processing independent of the main engine logic. The use cases are:

adding a second engine with merging of the results
adding tagged structure into the OCR detected elements (pdf2data's Pdf2DataTATRPostProcessor)
etc.

On this page we will use pdf2Data's Pdf2DataTATRPostProcessor for further processing steps.

3. Prepare Static Resources

If the engine used in Implement IOcrEngine doesn't need any static preparations, then you can skip this step. As for the LEADTools engine, it requires the following steps which are different for Java and .NET.

Java:

Download and add paths to your env:
- license (LEADTOOLS_LICENSE_PATH)
- JARs
- libs (PDF_2_DATA_OCR_LEADTOOLS_LIB_DIR)
- runtime (PDF_2_DATA_OCR_LEADTOOLS_RUNTIME_PATH). The path to the runtime should be used in the engine example
Add the JAR files to your project
Implement Initialization Method. This method must be invoked before your main code execution to set up the environment properly.

import java.io.File;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Paths;
import leadtools.LTLibrary;
import leadtools.Platform;
import leadtools.RasterSupport;

public class LEADToolsStaticInitializer {

    // Static method to initialize LEADTools resources
    public static void initializeResources() {
        String developerKey = new String(Files.readAllBytes(Paths.get(LEADTOOLS_LICENSE_KEY_PATH)),
                StandardCharsets.UTF_8);
        RasterSupport.setLicense(LEADTOOLS_LICENSE_PATH, developerKey);

        File leadtoolsLib = new File(PDF_2_DATA_OCR_LEADTOOLS_LIB_DIR);
        Platform.setLibPath(leadtoolsLib.getAbsolutePath());
        Platform.loadLibrary(LTLibrary.LEADTOOLS);
        Platform.loadLibrary(LTLibrary.CODECS);
        Platform.loadLibrary(LTLibrary.DOCUMENT_WRITER);
        Platform.loadLibrary(LTLibrary.OCR);
    }
}

.NET

Download and add paths to your env:
- license (LEADTOOLS_LICENSE_PATH)
- runtime (PDF_2_DATA_OCR_LEADTOOLS_RUNTIME_PATH). The path to the runtime should be used in the engine example
Integrate LEADTools into your project by adding the necessary NuGet packages
Implement Initialization Method. This method must be invoked before your main code execution to set up the environment properly.

using System;
using System.IO;
using System.Text;
using Leadtools;

namespace iText.Pdf2Data.Ocr.Engine
{
    public class LEADToolsStaticInitializer
    {
        // Static method to initialize LEADTools resources
        public static void InitializeResources()
        {
            String developerKey = File.ReadAllText(LEADTOOLS_LICENSE_KEY_PATH, Encoding.UTF8);
            RasterSupport.SetLicense(LEADTOOLS_LICENSE_PATH, developerKey);
        }
    }
}

4. Use custom engine in recognition

note

Don't forget to do Prepare Static Resources before first usage of your implemented custom engine.

Java:

   // Create custom engine
   CustomEngine customEngine = new CustomEngine();
   // Initialize table models for post processing
   Pdf2DataTATRPostProcessorStaticInitializer.initializeStaticModels(new File(PATH_TO_TABLE_DETECTION_MODEL),
                                                                     new File(PATH_TO_TABLE_STRUCTURE_MODEL));
   Pdf2DataTATRPostProcessor pdf2DataTATRPostProcessor = new Pdf2DataTATRPostProcessor();
   List<IOcrEnginePostProcessor> postProcessors = new ArrayList<>();
   postProcessors.add(pdf2DataTATRPostProcessor);
   // The next boolean flag indicates whether to write tag structure or not.
   // TATR post-processer generate this structure, so we need to add it.
   // If nether CustomEngine nor any of provided post-processers produces tag structure, 
   // then set this flag to false to increase performance.   
   boolean isTaggingSupported = true;

   // Create an engine based on CustomEngine
   OcrWithPostProcessingEngine ocrWithPostProcessingEngine = new OcrWithPostProcessingEngine(customEngine,
           postProcessors, isTaggingSupported);
   // Create an instance of Pdf2DataExtractor using the template and testEngine
   File template = new File(PATH_TO_TEMPLATE);
   Pdf2DataExtractor pdf2DataExtractor = Pdf2DataExtractor.create(template, ocrWithPostProcessingEngine);
   // Extract the recognition results from the image
   File image = new File(PATH_TO_IMAGE);
   RecognitionResult recognitionResults = pdf2DataExtractor.extract(image,
           RecognitionProperties.createForImageFile()).getResult();

.NET:

   // Create custom engine
   CustomEngine customEngine = new CustomEngine();
   IList<IOcrEnginePostProcessor> processorList = new List<IOcrEnginePostProcessor> {};
   // The next boolean flag indicates whether to write tag structure or not.
   // If nether CustomEngine nor any of provided post-processers produces tag structure, 
   // then set this flag to false to increase performance.  
   bool isTaggingSupported = true;    
   
   // Create an engine based on CustomEngine
   OcrWithPostProcessingEngine ocrWithPostProcessingEngine = new OcrWithPostProcessingEngine(customEngine,
       processorList, isTaggingSupported);
   // Create an instance of Pdf2DataExtractor using the template and testEngine
   FileInfo template = new FileInfo(PATH_TO_TEMPLATE);
   Pdf2DataExtractor pdf2DataExtractor = Pdf2DataExtractor.Create(template, ocrWithPostProcessingEngine);
   // Extract the recognition results from the image
   FileInfo image = new FileInfo(PATH_TO_IMAGE);
   RecognitionResult recognitionResults = pdf2DataExtractor.Extract(image,
   RecognitionProperties.CreateForImageFile()).GetResult();

Custom OCR Engine

1. Implement IOcrEngine​

Java: IOcrEngine Implementation Based on LEADTools​

.NET: IOcrEngine Implementation Based on LEADTools​

2. (Optional) Prepare OCR Post-Processing Logic​

3. Prepare Static Resources​

Java:​

.NET​

4. Use custom engine in recognition​

Java:​

.NET:​

1. Implement IOcrEngine

Java: IOcrEngine Implementation Based on LEADTools

.NET: IOcrEngine Implementation Based on LEADTools

2. (Optional) Prepare OCR Post-Processing Logic

3. Prepare Static Resources

Java:

.NET

4. Use custom engine in recognition

Java:

.NET: