Table of Contents

Class Tesseract

Namespace
Emgu.CV.OCR
Assembly
Emgu.CV.dll

The tesseract OCR engine

public class Tesseract : UnmanagedObject, IDisposable
Inheritance
Tesseract
Implements
Inherited Members

Constructors

Tesseract(bool)

Create a default tesseract engine. Needed to Call Init function to load language files in a later stage.

public Tesseract(bool enforceLocale = true)

Parameters

enforceLocale bool

If true, it will enforce "C" locale during the initialization.

Tesseract(string, string, OcrEngineMode, string, bool)

Create a Tesseract OCR engine.

public Tesseract(string dataPath, string language, OcrEngineMode mode, string whiteList = null, bool enforceLocale = true)

Parameters

dataPath string

The datapath must be the name of the directory of tessdata and must end in / . Any name after the last / will be stripped.

language string

The language is (usually) an ISO 639-3 string or NULL will default to eng. It is entirely safe (and eventually will be efficient too) to call Init multiple times on the same instance to change language, or just to reset the classifier. The language may be a string of the form [~]%lt;lang>[+[~]<lang>]* indicating that multiple languages are to be loaded. Eg hin+eng will load Hindi and English. Languages may specify internally that they want to be loaded with one or more other languages, so the ~ sign is available to override that. Eg if hin were set to load eng by default, then hin+~eng would force loading only hin. The number of loaded languages is limited only by memory, with the caveat that loading additional languages will impact both speed and accuracy, as there is more work to do to decide on the applicable language, and there is more chance of hallucinating incorrect words.

mode OcrEngineMode

OCR engine mode

whiteList string

This can be used to specify a white list for OCR. e.g. specify "1234567890" to recognize digits only. Note that the white list currently seems to only work with OcrEngineMode.OEM_TESSERACT_ONLY

enforceLocale bool

If true, we will change the locale to "C" before initializing the tesseract engine and reverting it back once the tesseract initialiation is completer. If false, it will be the user's responsibility to set the locale to "C", otherwise an exception will be thrown. See https://github.com/tesseract-ocr/tesseract/issues/1670

Properties

Datapath

Get the current location of tessdata.

public string Datapath { get; }

Property Value

string

DefaultTesseractDirectory

Get the default tesseract ocr directory. This should return the folder of the dll in most situations.

public static string DefaultTesseractDirectory { get; }

Property Value

string

Oem

Get the OCR Engine Mode

public OcrEngineMode Oem { get; }

Property Value

OcrEngineMode

PageSegMode

Gets or sets the page seg mode.

public PageSegMode PageSegMode { get; set; }

Property Value

PageSegMode

The page seg mode.

Version

Get the tesseract version

public static Version Version { get; }

Property Value

Version

VersionString

Get the tesseract version as String

public static string VersionString { get; }

Property Value

string

Methods

AnalyseLayout(bool)

Runs page layout analysis in the mode set by SetPageSegMode. May optionally be called prior to Recognize to get access to just the page layout results. Returns an iterator to the results. Returns NULL on error or an empty page. The returned iterator must be deleted after use. WARNING! This class points to data held within the TessBaseAPI class, and therefore can only be used while the TessBaseAPI class still exists and has not been subjected to a call of Init, SetImage, Recognize, Clear, End DetectOS, or anything else that changes the internal PAGE_RES.

public PageIterator AnalyseLayout(bool mergeSimilarWords = false)

Parameters

mergeSimilarWords bool

If true merge similar words

Returns

PageIterator

Page iterator

DisposeObject()

Release the unmanaged resource associated with this class

protected override void DisposeObject()

GetBoxText(int)

The recognized text is returned as coded in the same format as a box file used in training.

public string GetBoxText(int pageNumber = 0)

Parameters

pageNumber int

pageNumber is 0-based but will appear in the output as 1-based.

Returns

string

The recognized text is returned as coded in the same format as a box file used in training.

GetHOCRText(int)

Make a HTML-formatted string with hOCR markup from the internal data structures.

public string GetHOCRText(int pageNumber = 0)

Parameters

pageNumber int

pageNumber is 0-based but will appear in the output as 1-based.

Returns

string

A HTML-formatted string with hOCR markup from the internal data structures.

GetLangFileUrl(string)

Get the url to download the tessdata file for the specific language

public static string GetLangFileUrl(string lang)

Parameters

lang string

The 3 letter language identifier

Returns

string

the url to download the tessdata file for the specific language

GetOpenCLDevice(ref nint)

If compiled with OpenCL AND an available OpenCL device is deemed faster than serial code, then "device" is populated with the cl_device_id and returns sizeof(cl_device_id) otherwise *device=nullptr and returns 0.

public int GetOpenCLDevice(ref nint device)

Parameters

device nint

Pointer to the opencl device

Returns

int

0 if no device found. sizeof(cl_device_id) if device is found.

GetOsdText(int)

The recognized text

public string GetOsdText(int pageNumber = 0)

Parameters

pageNumber int

pageNumber is 0-based but will appear in the output as 1-based.

Returns

string

The recognized text

GetTSVText(int)

Make a TSV-formatted string from the internal data structures.

public string GetTSVText(int pageNumber = 0)

Parameters

pageNumber int

pageNumber is 0-based but will appear in the output as 1-based.

Returns

string

A TSV-formatted string from the internal data structures.

GetUNLVText(int)

The recognized text is returned coded as UNLV format Latin-1 with specific reject and suspect codes

public string GetUNLVText(int pageNumber = 0)

Parameters

pageNumber int

pageNumber is 0-based but will appear in the output as 1-based.

Returns

string

The recognized text is returned coded as UNLV format Latin-1 with specific reject and suspect codes

GetUTF8Text()

Get all the text in the image

public string GetUTF8Text()

Returns

string

All the text in the image

GetWords()

Detect all the words in the image.

public Tesseract.Word[] GetWords()

Returns

Word[]

All the words in the image

Init(byte[], string, OcrEngineMode)

Initialize the OCR engine using the raw .traineddata and language name.

public void Init(byte[] rawTrainedData, string language, OcrEngineMode mode)

Parameters

rawTrainedData byte[]

The raw trained data. e.g. for english, the rawTrainedData is the contents of eng.traineddata file.

language string

The language is (usually) an ISO 639-3 string or NULL will default to eng. It is entirely safe (and eventually will be efficient too) to call Init multiple times on the same instance to change language, or just to reset the classifier. The language may be a string of the form [~]%lt;lang>[+[~]<lang>]* indicating that multiple languages are to be loaded. Eg hin+eng will load Hindi and English. Languages may specify internally that they want to be loaded with one or more other languages, so the ~ sign is available to override that. Eg if hin were set to load eng by default, then hin+~eng would force loading only hin. The number of loaded languages is limited only by memory, with the caveat that loading additional languages will impact both speed and accuracy, as there is more work to do to decide on the applicable language, and there is more chance of hallucinating incorrect words.

mode OcrEngineMode

OCR engine mode

Init(string, string, OcrEngineMode)

Initialize the OCR engine using the specific dataPath and language name.

public void Init(string dataPath, string language, OcrEngineMode mode)

Parameters

dataPath string

The datapath must be the name of the parent directory of tessdata and must end in / . Any name after the last / will be stripped.

language string

The language is (usually) an ISO 639-3 string or NULL will default to eng. It is entirely safe (and eventually will be efficient too) to call Init multiple times on the same instance to change language, or just to reset the classifier. The language may be a string of the form [~]%lt;lang>[+[~]<lang>]* indicating that multiple languages are to be loaded. Eg hin+eng will load Hindi and English. Languages may specify internally that they want to be loaded with one or more other languages, so the ~ sign is available to override that. Eg if hin were set to load eng by default, then hin+~eng would force loading only hin. The number of loaded languages is limited only by memory, with the caveat that loading additional languages will impact both speed and accuracy, as there is more work to do to decide on the applicable language, and there is more chance of hallucinating incorrect words.

mode OcrEngineMode

OCR engine mode

IsValidWord(string)

Check whether a word is valid according to Tesseract's language model

public int IsValidWord(string word)

Parameters

word string

The word to be checked.

Returns

int

0 if the word is invalid, non-zero if valid

ProcessPage(Pix, int, string, string, int, ITessResultRenderer)

Turn a single image into symbolic text.

public bool ProcessPage(Pix pix, int pageIndex, string filename, string retryConfig, int timeoutMillisec, ITessResultRenderer renderer)

Parameters

pix Pix

The pix is the image processed.

pageIndex int

Metadata used by side-effect processes, such as reading a box file or formatting as hOCR.

filename string

Metadata used by side-effect processes, such as reading a box file or formatting as hOCR.

retryConfig string

retryConfig is useful for debugging. If not NULL, you can fall back to an alternate configuration if a page fails for some reason.

timeoutMillisec int

terminates processing if any single page takes too long. Set to 0 for unlimited time.

renderer ITessResultRenderer

Responsible for creating the output. For example, use the TessTextRenderer if you want plaintext output, or the TessPDFRender to produce searchable PDF.

Returns

bool

Returns true if successful, false on error.

Recognize()

Recognize the image from SetAndThresholdImage, generating Tesseract internal structures.

public int Recognize()

Returns

int

Returns 0 on success.

SetImage(IInputArray)

Set the image for optical character recognition

public void SetImage(IInputArray image)

Parameters

image IInputArray

The image where detection took place

SetImage(Pix)

Set the image for optical character recognition

public void SetImage(Pix image)

Parameters

image Pix

The image where detection took place

SetVariable(string, string)

Set the variable to the specific value.

public void SetVariable(string variableName, string value)

Parameters

variableName string

The name of the tesseract variable. e.g. use "tessedit_char_blacklist" to black list characters and "tessedit_char_whitelist" to white list characters. The full list of options can be found in the Tesseract OCR source code "tesseractclass.h"

value string

The value to be set