Invoke the command 'tesseract' installed in the application server which allows to process documents trhough 'OCR'.

1 Tesseract use

Tesseract is an engine of OCR of open code, available under the license Apache 2.0. It can be used directly, or (for the programmers) using an API. It supports a wide variety of languages.

Tesseract is directly incoporated into many Linux distributions. The package is generally called "Tesseract" or "tesseract - ocr'- search in the repositories of the distribution to found it. The packages are usually availables for differents languages (search in the repositories) but if not you will have to download the appropiated packages, unpack it and copy the file to the directory .traineddata tessdata, probably, /usr/share/tesseract- ocr /tessdata, /usr/share/Tesseract/tessdata, /usr/share/ tessdata or /usr/bin/Tesseract.

The executable is in different routes for Windows and Linux:

Copy
WINDOWS: dll/tesseract-w2k/
LINUX  : /usr/bin/tesseract

The language files .traineddata are placed in:

Copy
WINDOWS: dll/tesseract-w2k/tesseract.exe
LINUX:   /usr/share/tesseract-ocr/tessdata, /usr/share/tesseract/tessdata, /usr/share/tessdata.

Notas

For the Linux system, the Axional Studio uses the path /usr/share/Tesseract and indicates the variable of environment TESSDATA_PREFIX. Make sure that this directory contains the files tessdata or makes a simbolic link to this data. For the Windows system, the directory is the same where the executable resides.

Tesseract is a program of command lines, so first open a terminal or command prompt The command is used in this way:

Copy
tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]

The use basic to make the OCR of an image called 'myscan.png' and store the result in 'out.txt' would be the following:

Copy
tesseract myscan.png out

O to make the same in german:

Copy
tesseract myscan.png out -l deu

Tesseract also includes a mode of hOCR, which produces an especial HTML file with the coordinates of each word. This can be used to create a PDF of serch, using a tool as Hocr2PDF. To use it, use the configuration option 'hocr', like this:

Copy
tesseract myscan.png out hocr

2 runtime.plugin.tesseract

<runtime.plugin.tesseract encoding='encoding'/>