Tesseract is an engine of OCR of open code, available under the license Apache 2.0. It can be used directly, or (for the programmers) using an API. It supports a wide variety of languages.
Tesseract is directly incoporated into many Linux distributions. The package is generally called "Tesseract" or "tesseract - ocr'- search in the repositories of the distribution to found it. The packages are usually availables for differents languages (search in the repositories) but if not you will have to download the appropiated packages, unpack it and copy the file to the directory .traineddata tessdata, probably, /usr/share/tesseract- ocr /tessdata, /usr/share/Tesseract/tessdata, /usr/share/ tessdata or /usr/bin/Tesseract.
The executable is in different routes for Windows and Linux:
WINDOWS: dll/tesseract-w2k/ LINUX : /usr/bin/tesseract
The language files .traineddata are placed in:
WINDOWS: dll/tesseract-w2k/tesseract.exe LINUX: /usr/share/tesseract-ocr/tessdata, /usr/share/tesseract/tessdata, /usr/share/tessdata.
For the Linux system, the Axional Studio uses the path /usr/share/Tesseract and indicates the variable of environment TESSDATA_PREFIX. Make sure that this directory contains the files tessdata or makes a simbolic link to this data. For the Windows system, the directory is the same where the executable resides.
Tesseract is a program of command lines, so first open a terminal or command prompt The command is used in this way:
tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]
The use basic to make the OCR of an image called 'myscan.png' and store the result in 'out.txt' would be the following:
tesseract myscan.png out
O to make the same in german:
tesseract myscan.png out -l deu
Tesseract also includes a mode of hOCR, which produces an especial HTML file with the coordinates of each word. This can be used to create a PDF of serch, using a tool as Hocr2PDF. To use it, use the configuration option 'hocr', like this:
tesseract myscan.png out hocr
|Aencoding||string||Type of coding of the input data (ISO-8859-1, UTF-8 ...).|