Class Tesseract

Namespace: Emgu.CV.OCR

Assembly: Emgu.CV.dll

The tesseract OCR engine

public class Tesseract : UnmanagedObject, IDisposable

Inheritance: object

DisposableObject

UnmanagedObject

Tesseract

Implements: IDisposable

Inherited Members: UnmanagedObject._ptr

UnmanagedObject.Ptr

DisposableObject.Dispose()

DisposableObject.ReleaseManagedResources()

object.GetType()

object.MemberwiseClone()

object.ToString()

object.Equals(object)

object.Equals(object, object)

object.ReferenceEquals(object, object)

object.GetHashCode()

Constructors

Tesseract(bool)

Create a default tesseract engine. Needed to Call Init function to load language files in a later stage.

public Tesseract(bool enforceLocale = true)

Parameters

enforceLocale bool: If true, it will enforce "C" locale during the initialization.

Tesseract(string, string, OcrEngineMode, string, bool)

Create a Tesseract OCR engine.

public Tesseract(string dataPath, string language, OcrEngineMode mode, string whiteList = null, bool enforceLocale = true)

Parameters

dataPath string: The datapath must be the name of the directory of tessdata and must end in / . Any name after the last / will be stripped.
language string: The language is (usually) an ISO 639-3 string or NULL will default to eng. It is entirely safe (and eventually will be efficient too) to call Init multiple times on the same instance to change language, or just to reset the classifier. The language may be a string of the form [~]%lt;lang>[+[~]<lang>]* indicating that multiple languages are to be loaded. Eg hin+eng will load Hindi and English. Languages may specify internally that they want to be loaded with one or more other languages, so the ~ sign is available to override that. Eg if hin were set to load eng by default, then hin+~eng would force loading only hin. The number of loaded languages is limited only by memory, with the caveat that loading additional languages will impact both speed and accuracy, as there is more work to do to decide on the applicable language, and there is more chance of hallucinating incorrect words.
mode OcrEngineMode: OCR engine mode
whiteList string: This can be used to specify a white list for OCR. e.g. specify "1234567890" to recognize digits only. Note that the white list currently seems to only work with OcrEngineMode.OEM_TESSERACT_ONLY
enforceLocale bool: If true, we will change the locale to "C" before initializing the tesseract engine and reverting it back once the tesseract initialiation is completer. If false, it will be the user's responsibility to set the locale to "C", otherwise an exception will be thrown. See https://github.com/tesseract-ocr/tesseract/issues/1670

Properties

Datapath

Get the current location of tessdata.

public string Datapath { get; }

Property Value

string

DefaultTesseractDirectory

Get the default tesseract ocr directory. This should return the folder of the dll in most situations.

public static string DefaultTesseractDirectory { get; }

Property Value

string

Oem

Get the OCR Engine Mode

public OcrEngineMode Oem { get; }

Property Value

OcrEngineMode

PageSegMode

Gets or sets the page seg mode.

public PageSegMode PageSegMode { get; set; }

Property Value

PageSegMode: The page seg mode.

Version

Get the tesseract version

public static Version Version { get; }

Property Value

Version

VersionString

Get the tesseract version as String

public static string VersionString { get; }

Property Value

string

Methods

AnalyseLayout(bool)

Runs page layout analysis in the mode set by SetPageSegMode. May optionally be called prior to Recognize to get access to just the page layout results. Returns an iterator to the results. Returns NULL on error or an empty page. The returned iterator must be deleted after use. WARNING! This class points to data held within the TessBaseAPI class, and therefore can only be used while the TessBaseAPI class still exists and has not been subjected to a call of Init, SetImage, Recognize, Clear, End DetectOS, or anything else that changes the internal PAGE_RES.

public PageIterator AnalyseLayout(bool mergeSimilarWords = false)

Parameters

mergeSimilarWords bool: If true merge similar words

Returns

PageIterator: Page iterator

DisposeObject()

Release the unmanaged resource associated with this class

protected override void DisposeObject()

GetBoxText(int)

The recognized text is returned as coded in the same format as a box file used in training.

public string GetBoxText(int pageNumber = 0)

Parameters

pageNumber int: pageNumber is 0-based but will appear in the output as 1-based.

Returns

string: The recognized text is returned as coded in the same format as a box file used in training.

GetHOCRText(int)

Make a HTML-formatted string with hOCR markup from the internal data structures.

public string GetHOCRText(int pageNumber = 0)

Parameters

pageNumber int: pageNumber is 0-based but will appear in the output as 1-based.

Returns

string: A HTML-formatted string with hOCR markup from the internal data structures.

GetLangFileUrl(string)

Get the url to download the tessdata file for the specific language

public static string GetLangFileUrl(string lang)

Parameters

lang string: The 3 letter language identifier

Returns

string: the url to download the tessdata file for the specific language

GetOpenCLDevice(ref nint)

If compiled with OpenCL AND an available OpenCL device is deemed faster than serial code, then "device" is populated with the cl_device_id and returns sizeof(cl_device_id) otherwise *device=nullptr and returns 0.

public int GetOpenCLDevice(ref nint device)

Parameters

device nint: Pointer to the opencl device

Returns

int: 0 if no device found. sizeof(cl_device_id) if device is found.

GetOsdText(int)

The recognized text

public string GetOsdText(int pageNumber = 0)

Parameters

pageNumber int: pageNumber is 0-based but will appear in the output as 1-based.

Returns

string: The recognized text

GetTSVText(int)

Make a TSV-formatted string from the internal data structures.

public string GetTSVText(int pageNumber = 0)

Parameters

pageNumber int: pageNumber is 0-based but will appear in the output as 1-based.

Returns

string: A TSV-formatted string from the internal data structures.

GetUNLVText(int)

The recognized text is returned coded as UNLV format Latin-1 with specific reject and suspect codes

public string GetUNLVText(int pageNumber = 0)

Parameters

pageNumber int: pageNumber is 0-based but will appear in the output as 1-based.

Returns

string: The recognized text is returned coded as UNLV format Latin-1 with specific reject and suspect codes

GetUTF8Text()

Get all the text in the image

public string GetUTF8Text()

Returns

string: All the text in the image

GetWords()

Detect all the words in the image.

public Tesseract.Word[] GetWords()

Returns

Word[]: All the words in the image

Init(byte[], string, OcrEngineMode)

Initialize the OCR engine using the raw .traineddata and language name.

public void Init(byte[] rawTrainedData, string language, OcrEngineMode mode)

Parameters

rawTrainedData byte[]: The raw trained data. e.g. for english, the rawTrainedData is the contents of eng.traineddata file.
language string: The language is (usually) an ISO 639-3 string or NULL will default to eng. It is entirely safe (and eventually will be efficient too) to call Init multiple times on the same instance to change language, or just to reset the classifier. The language may be a string of the form [~]%lt;lang>[+[~]<lang>]* indicating that multiple languages are to be loaded. Eg hin+eng will load Hindi and English. Languages may specify internally that they want to be loaded with one or more other languages, so the ~ sign is available to override that. Eg if hin were set to load eng by default, then hin+~eng would force loading only hin. The number of loaded languages is limited only by memory, with the caveat that loading additional languages will impact both speed and accuracy, as there is more work to do to decide on the applicable language, and there is more chance of hallucinating incorrect words.
mode OcrEngineMode: OCR engine mode

Init(string, string, OcrEngineMode)

Initialize the OCR engine using the specific dataPath and language name.

public void Init(string dataPath, string language, OcrEngineMode mode)

Parameters

dataPath string: The datapath must be the name of the parent directory of tessdata and must end in / . Any name after the last / will be stripped.
language string: The language is (usually) an ISO 639-3 string or NULL will default to eng. It is entirely safe (and eventually will be efficient too) to call Init multiple times on the same instance to change language, or just to reset the classifier. The language may be a string of the form [~]%lt;lang>[+[~]<lang>]* indicating that multiple languages are to be loaded. Eg hin+eng will load Hindi and English. Languages may specify internally that they want to be loaded with one or more other languages, so the ~ sign is available to override that. Eg if hin were set to load eng by default, then hin+~eng would force loading only hin. The number of loaded languages is limited only by memory, with the caveat that loading additional languages will impact both speed and accuracy, as there is more work to do to decide on the applicable language, and there is more chance of hallucinating incorrect words.
mode OcrEngineMode: OCR engine mode

IsValidWord(string)

Check whether a word is valid according to Tesseract's language model

public int IsValidWord(string word)

Parameters

word string: The word to be checked.

Returns

int: 0 if the word is invalid, non-zero if valid

ProcessPage(Pix, int, string, string, int, ITessResultRenderer)

Turn a single image into symbolic text.

public bool ProcessPage(Pix pix, int pageIndex, string filename, string retryConfig, int timeoutMillisec, ITessResultRenderer renderer)

Parameters

pix Pix: The pix is the image processed.
pageIndex int: Metadata used by side-effect processes, such as reading a box file or formatting as hOCR.
filename string: Metadata used by side-effect processes, such as reading a box file or formatting as hOCR.
retryConfig string: retryConfig is useful for debugging. If not NULL, you can fall back to an alternate configuration if a page fails for some reason.
timeoutMillisec int: terminates processing if any single page takes too long. Set to 0 for unlimited time.
renderer ITessResultRenderer: Responsible for creating the output. For example, use the TessTextRenderer if you want plaintext output, or the TessPDFRender to produce searchable PDF.

Returns

bool: Returns true if successful, false on error.

Recognize()

Recognize the image from SetAndThresholdImage, generating Tesseract internal structures.

public int Recognize()

Returns

int: Returns 0 on success.

SetImage(IInputArray)

Set the image for optical character recognition

public void SetImage(IInputArray image)

Parameters

image IInputArray: The image where detection took place

SetImage(Pix)

Set the image for optical character recognition

public void SetImage(Pix image)

Parameters

image Pix: The image where detection took place

SetVariable(string, string)

Set the variable to the specific value.

public void SetVariable(string variableName, string value)

Parameters

variableName string: The name of the tesseract variable. e.g. use "tessedit_char_blacklist" to black list characters and "tessedit_char_whitelist" to white list characters. The full list of options can be found in the Tesseract OCR source code "tesseractclass.h"
value string: The value to be set

Table of Contents

Class Tesseract

Constructors

Tesseract(bool)

Parameters

Tesseract(string, string, OcrEngineMode, string, bool)

Parameters

Properties

Datapath

Property Value

DefaultTesseractDirectory

Property Value

Oem

Property Value

PageSegMode

Property Value

Version

Property Value

VersionString

Property Value

Methods

AnalyseLayout(bool)

Parameters

Returns

DisposeObject()

GetBoxText(int)

Parameters

Returns

GetHOCRText(int)

Parameters

Returns

GetLangFileUrl(string)

Parameters

Returns

GetOpenCLDevice(ref nint)

Parameters

Returns

GetOsdText(int)

Parameters

Returns

GetTSVText(int)

Parameters

Returns

GetUNLVText(int)

Parameters

Returns

GetUTF8Text()

Returns

GetWords()

Returns

Init(byte[], string, OcrEngineMode)

Parameters

Init(string, string, OcrEngineMode)

Parameters

IsValidWord(string)

Parameters

Returns

ProcessPage(Pix, int, string, string, int, ITessResultRenderer)

Parameters

Returns

Recognize()

Returns

SetImage(IInputArray)

Parameters

SetImage(Pix)

Parameters

SetVariable(string, string)

Parameters