Welcome to the Tesseract.Net SDK

Tesseract.Net SDK it's a class library based on the tesseract-ocr project. It can read a wide variety of image formats and convert them to text in over 60 languages.

Compatibility

Tesseract.Net SDK is available for .Net Framework 2.0 - 4.5 on 32- and 64-bit operating systems.

SDK has been tested with Windows XP, Vista, 7, 8, 8.1 and 10, and is fully compatible with all of them. The native tesseract.dll library included to this SDK is supplied in both 32-bit and 64-bit versions, so your .NET application can be "Any CPU".

Getting Started

Download the installation package, unpack it and copy the following files into your project directory

netXX\
    Patagames.Ocr.dll
    Patagames.Ocr.xml
x64\
    tesseract.dll
x86\
    tesseract.dll
tessdata\
    configs\
    eng.*
    pdf.ttf
    pdf.ttx

Note
The eng.* files correspond to English language, which is supplied in the standard package. If you need to use other languages, download them separately here and put into the tessdata folder. If you need to configure the engine via configuration files then put it to configs folder. Otherwise, this folder are not needed, and you can remove it.

Note

The eng.* files correspond to English language, which is supplied in the standard package. If you need to use other languages, download them separately here and put into the tessdata folder.

If you need to configure the engine via configuration files then put it to configs folder. Otherwise, this folder are not needed, and you can remove it.

To use the library, you must first add a reference to Patagames.Ocr.dll file specific to your framework version into your project.

Note
The Patagames.Ocr.xml contains XML documentation of the API. As long as it is in the same folder as Patagames.Ocr.dll, the Visual Studio displays hints to API functions. You don’t need to distribute this file.

After you've added the above reference, you need to add the following files to your project:

x86\tesseract.dll is the 32-bit version of the Tesseract library;
x64\tesseract.dll is the 64-bit version of the Tesseract library;
tessdata folder contains files required for the Tesseract engine to work;

Note
You have two options. If your application is 32-bit only or 64-bit only, you can remove the DLL that won't be used. You can leave this file in the x86 or x64 directory, or move it to the root of your project. The library will find the DLL in both cases.

When building your project, the tesseract.dll library(s) must be placed next to your application, either in the root or the x86 or x64 sub directory. The tessdata folder also must be placed next to your application in the root directory. The easiest way to accomplish this is by changing the properties of those files, changing the Copy to Output Directory setting to Copy always.

That's all. Your project is ready to use Tesseract.Net SDK

Now you can create an instance of OcrApi class

Copy

using Patagames.Ocr;
...
var api = OcrApi.Create();

Important
If you have problems with creation (if it says something like "Unable to load DLL 'tesseract.dll'"), try to specify the full path to tesseract.dll through the OcrApi.PathToEngine property. Read here for more details.

Using Tesseract.Net SDK in your app

Adding OCR functionality to your app using Tesseract.Net SDK is easy. The main class encapsulating all the high-level API of the library is OcrApi. The OcrResultRenderer class and its childs are for translating the recognition result to certain output formats including PDF, HTML and others. Low-level functions that allow you to work with individual paragraphs, words, letters and font parameters are implemented in the OcrResultIterator class.

You can find the detailed class diagram here.

Here is a typical C# code demonstrating how to extract plain text from the image. Here we create an instance of the OcrApi class and initialize it for English language. Then, we simply get the text from the image.

Copy

public void ExtractTextFromImage()
{
    using (var api = OcrApi.Create())
    {
        api.Init(Languages.English);
        string plainText = api.GetTextFromImage("c:\test_image.png");
    }
}

Not only does the GetTextFromImage method works with image files, but it also can recognize text on a given bitmap, for instance System.Drawing.Bitmap. Here is an example of such use:

Copy

public void ExtractTextFromBitmap()
{
    using (var api = OcrApi.Create())
    {
        api.Init(Languages.English);
        using (var bmp = Bitmap.FromFile("c:\test_image.png") as Bitmap)
        {
            string plainText = api.GetTextFromImage(bmp);
        }
    }
}

If you compile the above examples and feed the following image to the app:

it produces the following plain text result:

“There's no such thing
as being too busy.
If you really want something,
you'll make time for it.”

However, we are usually more interested in creating a searchable PDF from those scanned images, not a plain text. And you need just a tiny modification of the above code to make it produce a PDF instead:

Copy

public void Tiff2Pdf()
{
    using (var api = OcrApi.Create())
    {
        api.Init(Languages.English);
        //Create the renderer to PDF file output. The extension will be added automatically
        using (var renderer = OcrPdfRenderer.Create("multipage_pdf_file"))
        {
            renderer.BeginDocument("Title");
            api.ProcessPages(@"c:\multipage.tif", renderer);
            renderer.EndDocument();
        }
    }
}

Here, we create a PDF renderer and make the API process pages of the source image (multipage.tif) and return the recognition result to the renderer. As a result, output.pdf file is created. The file contains pages of the scanned image represented as text you can copy and search.

If you provide this multipage tiff to the code above it produces the following PDF document.

You don’t have to convert images to a multipage TIFF before building a PDF file with recognized text. In fact, even if you have a scope of individual images, you still can merge them all into one PDF file. To do this, supply a plain text file to the ProcessPage method instead of the TIFF. This plain text file should contain filenames of all individual images, one per line: "scanned.txt"

C:\page1.jpg
c:\page2.png
c:\page3.bmp

Modify the code as shown below and you will get a 3-page searchable PDF file from these 3 individual images.

Copy

public void MultiplyImages2Pdf()
{
    using (var api = OcrApi.Create())
    {
        api.Init(Languages.English);
        //Create the renderer to PDF file output. The extension will be added automatically
        using (var renderer = OcrPdfRenderer.Create("output_pdf_file"))
        {
            renderer.BeginDocument("Title");
            api.ProcessPages(@"c:\scanned.txt", renderer);
            renderer.EndDocument();
        }
    }
}

Other Resources

Version History