Class Tesseract
The tesseract OCR engine
public class Tesseract : UnmanagedObject, IDisposable
- Inheritance
-
Tesseract
- Implements
- Inherited Members
Constructors
Tesseract(bool)
Create a default tesseract engine. Needed to Call Init function to load language files in a later stage.
public Tesseract(bool enforceLocale = true)
Parameters
enforceLocale
boolIf true, it will enforce "C" locale during the initialization.
Tesseract(string, string, OcrEngineMode, string, bool)
Create a Tesseract OCR engine.
public Tesseract(string dataPath, string language, OcrEngineMode mode, string whiteList = null, bool enforceLocale = true)
Parameters
dataPath
stringThe datapath must be the name of the directory of tessdata and must end in / . Any name after the last / will be stripped.
language
stringThe language is (usually) an ISO 639-3 string or NULL will default to eng. It is entirely safe (and eventually will be efficient too) to call Init multiple times on the same instance to change language, or just to reset the classifier. The language may be a string of the form [~]%lt;lang>[+[~]<lang>]* indicating that multiple languages are to be loaded. Eg hin+eng will load Hindi and English. Languages may specify internally that they want to be loaded with one or more other languages, so the ~ sign is available to override that. Eg if hin were set to load eng by default, then hin+~eng would force loading only hin. The number of loaded languages is limited only by memory, with the caveat that loading additional languages will impact both speed and accuracy, as there is more work to do to decide on the applicable language, and there is more chance of hallucinating incorrect words.
mode
OcrEngineModeOCR engine mode
whiteList
stringThis can be used to specify a white list for OCR. e.g. specify "1234567890" to recognize digits only. Note that the white list currently seems to only work with OcrEngineMode.OEM_TESSERACT_ONLY
enforceLocale
boolIf true, we will change the locale to "C" before initializing the tesseract engine and reverting it back once the tesseract initialiation is completer. If false, it will be the user's responsibility to set the locale to "C", otherwise an exception will be thrown. See https://github.com/tesseract-ocr/tesseract/issues/1670
Properties
Datapath
Get the current location of tessdata.
public string Datapath { get; }
Property Value
DefaultTesseractDirectory
Get the default tesseract ocr directory. This should return the folder of the dll in most situations.
public static string DefaultTesseractDirectory { get; }
Property Value
Oem
Get the OCR Engine Mode
public OcrEngineMode Oem { get; }
Property Value
PageSegMode
Gets or sets the page seg mode.
public PageSegMode PageSegMode { get; set; }
Property Value
- PageSegMode
The page seg mode.
Version
Get the tesseract version
public static Version Version { get; }
Property Value
VersionString
Get the tesseract version as String
public static string VersionString { get; }
Property Value
Methods
AnalyseLayout(bool)
Runs page layout analysis in the mode set by SetPageSegMode. May optionally be called prior to Recognize to get access to just the page layout results. Returns an iterator to the results. Returns NULL on error or an empty page. The returned iterator must be deleted after use. WARNING! This class points to data held within the TessBaseAPI class, and therefore can only be used while the TessBaseAPI class still exists and has not been subjected to a call of Init, SetImage, Recognize, Clear, End DetectOS, or anything else that changes the internal PAGE_RES.
public PageIterator AnalyseLayout(bool mergeSimilarWords = false)
Parameters
mergeSimilarWords
boolIf true merge similar words
Returns
- PageIterator
Page iterator
DisposeObject()
Release the unmanaged resource associated with this class
protected override void DisposeObject()
GetBoxText(int)
The recognized text is returned as coded in the same format as a box file used in training.
public string GetBoxText(int pageNumber = 0)
Parameters
pageNumber
intpageNumber is 0-based but will appear in the output as 1-based.
Returns
- string
The recognized text is returned as coded in the same format as a box file used in training.
GetHOCRText(int)
Make a HTML-formatted string with hOCR markup from the internal data structures.
public string GetHOCRText(int pageNumber = 0)
Parameters
pageNumber
intpageNumber is 0-based but will appear in the output as 1-based.
Returns
- string
A HTML-formatted string with hOCR markup from the internal data structures.
GetLangFileUrl(string)
Get the url to download the tessdata file for the specific language
public static string GetLangFileUrl(string lang)
Parameters
lang
stringThe 3 letter language identifier
Returns
- string
the url to download the tessdata file for the specific language
GetOpenCLDevice(ref nint)
If compiled with OpenCL AND an available OpenCL device is deemed faster than serial code, then "device" is populated with the cl_device_id and returns sizeof(cl_device_id) otherwise *device=nullptr and returns 0.
public int GetOpenCLDevice(ref nint device)
Parameters
device
nintPointer to the opencl device
Returns
- int
0 if no device found. sizeof(cl_device_id) if device is found.
GetOsdText(int)
The recognized text
public string GetOsdText(int pageNumber = 0)
Parameters
pageNumber
intpageNumber is 0-based but will appear in the output as 1-based.
Returns
- string
The recognized text
GetTSVText(int)
Make a TSV-formatted string from the internal data structures.
public string GetTSVText(int pageNumber = 0)
Parameters
pageNumber
intpageNumber is 0-based but will appear in the output as 1-based.
Returns
- string
A TSV-formatted string from the internal data structures.
GetUNLVText(int)
The recognized text is returned coded as UNLV format Latin-1 with specific reject and suspect codes
public string GetUNLVText(int pageNumber = 0)
Parameters
pageNumber
intpageNumber is 0-based but will appear in the output as 1-based.
Returns
- string
The recognized text is returned coded as UNLV format Latin-1 with specific reject and suspect codes
GetUTF8Text()
Get all the text in the image
public string GetUTF8Text()
Returns
- string
All the text in the image
GetWords()
Detect all the words in the image.
public Tesseract.Word[] GetWords()
Returns
- Word[]
All the words in the image
Init(byte[], string, OcrEngineMode)
Initialize the OCR engine using the raw .traineddata and language name.
public void Init(byte[] rawTrainedData, string language, OcrEngineMode mode)
Parameters
rawTrainedData
byte[]The raw trained data. e.g. for english, the rawTrainedData is the contents of eng.traineddata file.
language
stringThe language is (usually) an ISO 639-3 string or NULL will default to eng. It is entirely safe (and eventually will be efficient too) to call Init multiple times on the same instance to change language, or just to reset the classifier. The language may be a string of the form [~]%lt;lang>[+[~]<lang>]* indicating that multiple languages are to be loaded. Eg hin+eng will load Hindi and English. Languages may specify internally that they want to be loaded with one or more other languages, so the ~ sign is available to override that. Eg if hin were set to load eng by default, then hin+~eng would force loading only hin. The number of loaded languages is limited only by memory, with the caveat that loading additional languages will impact both speed and accuracy, as there is more work to do to decide on the applicable language, and there is more chance of hallucinating incorrect words.
mode
OcrEngineModeOCR engine mode
Init(string, string, OcrEngineMode)
Initialize the OCR engine using the specific dataPath and language name.
public void Init(string dataPath, string language, OcrEngineMode mode)
Parameters
dataPath
stringThe datapath must be the name of the parent directory of tessdata and must end in / . Any name after the last / will be stripped.
language
stringThe language is (usually) an ISO 639-3 string or NULL will default to eng. It is entirely safe (and eventually will be efficient too) to call Init multiple times on the same instance to change language, or just to reset the classifier. The language may be a string of the form [~]%lt;lang>[+[~]<lang>]* indicating that multiple languages are to be loaded. Eg hin+eng will load Hindi and English. Languages may specify internally that they want to be loaded with one or more other languages, so the ~ sign is available to override that. Eg if hin were set to load eng by default, then hin+~eng would force loading only hin. The number of loaded languages is limited only by memory, with the caveat that loading additional languages will impact both speed and accuracy, as there is more work to do to decide on the applicable language, and there is more chance of hallucinating incorrect words.
mode
OcrEngineModeOCR engine mode
IsValidWord(string)
Check whether a word is valid according to Tesseract's language model
public int IsValidWord(string word)
Parameters
word
stringThe word to be checked.
Returns
- int
0 if the word is invalid, non-zero if valid
ProcessPage(Pix, int, string, string, int, ITessResultRenderer)
Turn a single image into symbolic text.
public bool ProcessPage(Pix pix, int pageIndex, string filename, string retryConfig, int timeoutMillisec, ITessResultRenderer renderer)
Parameters
pix
PixThe pix is the image processed.
pageIndex
intMetadata used by side-effect processes, such as reading a box file or formatting as hOCR.
filename
stringMetadata used by side-effect processes, such as reading a box file or formatting as hOCR.
retryConfig
stringretryConfig is useful for debugging. If not NULL, you can fall back to an alternate configuration if a page fails for some reason.
timeoutMillisec
intterminates processing if any single page takes too long. Set to 0 for unlimited time.
renderer
ITessResultRendererResponsible for creating the output. For example, use the TessTextRenderer if you want plaintext output, or the TessPDFRender to produce searchable PDF.
Returns
- bool
Returns true if successful, false on error.
Recognize()
Recognize the image from SetAndThresholdImage, generating Tesseract internal structures.
public int Recognize()
Returns
- int
Returns 0 on success.
SetImage(IInputArray)
Set the image for optical character recognition
public void SetImage(IInputArray image)
Parameters
image
IInputArrayThe image where detection took place
SetImage(Pix)
Set the image for optical character recognition
public void SetImage(Pix image)
Parameters
image
PixThe image where detection took place
SetVariable(string, string)
Set the variable to the specific value.
public void SetVariable(string variableName, string value)