OCR (Optical Character Recognition) is a new and advanced optical recognition technology. It extracts the text from an image or scanned document to be automatically stored and indexed in a database, among other features. As a general characteristic, this data recognition system is applied using regular expressions of pattern recognition within the text.
Among other features,
Axional OCR facilitates the data entry for various types of documents that serve as the starting point for company-specific business processes.
Likewise, it simplifies information storage and filing, removing the need for access to the physical document in order to examine it in detail.
Axional OCR thus provides an efficient information entry system for company databases, making the integration of any structured physical document possible.
This Axional model aims, more specifically, to automatically integrate digitized supplier invoices into the ERP system database.
1 Prerequisites: PDF generation
For the correct operation of the application, it is necessary documents to be digital PDF with text layer, that is, documents such as scanned paper documents or PDF files, which have been transformed into digitized texts. The transformed document looks exactly like the original, but allows the data recognition of the into searchable data. It is easy to recognize these types of files, since the text is selectable.
Nowadays it is very common to receive invoices from suppliers by e-mail, and they are very likely to be in a PDF with text layer format.
When no PDF with text layer are available, it is necessary to transform them. This transformation is an external procedure to the application. Document generation can be done by an external provider or by using a document scanning application with special capabilities. For example, you can use Tesseract, as an Open Source OCR Engine. Also most current printers with scanner have an OCR application.
The process of transforming a document (either paper or scanned) into a PDF with text layer is an external process that is not part of the scope of
Axional OCR application.
The process of integrating data obtained from the PDF document into the system is carried out in several consecutive stages.
The functioning of the application has different stages:
- PDF Generation: external procedure (see previous section).
- PDF Loading definition: the system loads PDF files into the system using a previous defined configuration depending on each type of document. This configuration includes specific folders on server where to place loaded or processed files.
- Template creation: each type of document to be processed must have an assigned template. This template is created based on a prototype document of each type of document. For example, client invoices must have a template assigned, since each of these documents must have a repetitive structure and contains the same data that interest us for the extraction. Another type of document, for example, an ID/DNI, must have its own template. For each different type of document it is necessary to configure a new template.
- Data extraction: the system will try to load data based on a loading model and to extract data based on an assigned template.
- Internal integration: processed data will be transferred to a predefined internal table (destiny table). This is the last step of
Axional OCRfunctioning process.
The integration of processed data from the destiny table into the client system is an external process that is not part of the scope of Axional OCR application.