1 External integration

This section includes relevant information on how to integrate the information into the system so that the Axional OCR application is functional for your specific needs.

 

1.1 Creating a Type of document

We are going to define a new type of document that we want the system to process and integrate in our database. This means, assign a name and select those data fields that we want the template to look for.

Some of these fields are mandatory, but some of them are extra information that may be useful, but not essential. Later we will use template form to define were the system can find these fields in this specific type of document.

  • Access Data Extraction/Management of templates/Type of Documents.
  • Press Execute/Run icon without specifying any field.
  • Reset form data with Clear button.
  • Define code and name in capital letters.
  • Define fields that the system will look for and assign if they are required (mandatory fields) or not.
  • Press INSERT icon to save the type of document.

  • Code*: unique name in uppercase
  • Name*: description text in uppercase
  • Integration: name of the Integration Configuration assigned to this type of document. IMPORTANT: this field is mandatory to perform transfer processes, but the system allows not filling it in for practical reasons.
 

1.2 Uploading configuration

The type of document that we have previously defined will be uploaded by the system using some criteria. For example, if our document type will always have only one page or maybe several.

Imagine that we have defined that the document has only one page: during the PDF file uploading the system will create as many independent documents as pages the file had.

Access External integration/Load settings and create a new configuration. Click on Clear button and fill in following fields:

Configuration:
  • Source code*: name of the new uploading configuration; unique and uppercase.
  • Description of origin: free text description.
  • Folder file entry: folder on server from where the system loads the documents to process. As soon as processed files on it will be transferred to Processed folder, as a practical way to know which files have been transferred or not.
  • Processed folder*: folder on server where the system will transfer the processed documents.
  • Status*: active or disabled.
Generation:
  • Type of document*: code of documents that correspond to this configuration.
  • Type of splitter*: instructs the system on how to divide the files to generate individual documents in order to apply corresponding templates. This is useful because files to be processed can have more than one document (for instance, more than one invoice). Possible options are:
    • Do not split.
    • White page: there is a white page between documents.
    • At the mark: instructs the system on how to divide files if there is a mark on them. This mark has to be an identifying mark, such as a company unique text, a readable barcode or a readable stamp. When this option is selected it is necessary to define Mark partition and Pattern to apply.
    • Each page.
    • Each 2/3/4 pages.
  • Only when At the mark option has been previously selected:
    • Mark partition*: indicates where this mark is located within the document. If the mark is located on the first page, this will indicate the beginning of a new document. By the contrary, if the mark is located on the last page, it indicates the end of the document.
    • Pattern to apply: text or regular expression to identify split point.

Finally, press INSERT icon to save the type of document.

 

1.3 Automatically assign templates to documents

Once the document has been split, the system uses regular expressions to identify documents and assigns them the corresponding template.

Regular expressions provide a very flexible way to search or recognize text strings within the document.

For example, use your supplier's ID/NIF to easily identify his invoices. This way his invoices template will be automatically assign to the document.

This feature is defined in the template, so it must be done further on during template edition.

See section Regular expressions to assign templates.

 

2 Data extraction

To extract data, Axional OCR relies on extraction templates, usually one for each type of document. The creation of a template that includes all possible cases related to the data extraction is the clue for the proper functioning of the process.

 

2.1 Creating a new template

In order to create a new template, follow next steps:

  1. Load prototype document
  2. Process prototype document
  3. Save template
  4. Edit and validate template
 

2.1.1 Loading prototype document to create a Template

To create a new template it is necessary to load a valid document as a prototype.

  • Access the External Integration/Repository of OCR Documents.
  • Reset form data with Clear button.
  • Insert your load code in the Load code field, previously created on External integration/Load settings.
  • Insert a new Source name on Source field.
  • On the File data tab, load prototype file on CHOOSE FILE button. This is a mandatory step.
  • Press INSERT icon : a message !Document is not processed will appear on document edit screen. The STATUS now is NEW.

Every time a file is upload, the system assigns and ID BATCH number.

In Repository, uploaded documents can have three states: new (upload but not split), processed (split using the splitter) and canceled (split canceled by user).

 

2.1.2 Processing prototype document

  • Press Process button at the screen bottom: the system will process the document. The STATUS now is PROCESSED and file is on server.
  • Press Process documents button at the screen bottom: the system will process the document. The STATUS now is PROCESSED and data is accessible.
  • PROTOTYPE REVIEW. Sometimes the prototype has more than one page or more than one document and needs some modifications. Use the buttons on the right to make these changes.
Joins current page with the previous one in a unique document.
Joins current page with the next one in a unique document.
Adds a new document partition between pages.
Removes/restores page from template. Removed pages can be seen on Repository OCR, but not on final documents.
Marks page as an attachment.

During this process all steps can be undone with the Cancel button.

 

2.1.3 Saving the new template

Now document is ready to become a template.

  • Press edit icon in order to create the template, since there isn't any template that fits this type of document (only use Search for a template if it has been previously created).
  • A new window shows two new options: write an existing template to update it or, in our case, ASSIGN NEW TEMPLATE. Then write the name and description of the new template.

The new template has been created, but the system describes it as incomplete because no data fields have been located on template. See next chapter in order to EDIT TEMPLATE.

During this process all steps can be undone with the Cancel button.

 

2.1.4 Editing an existing template

Although we have created our template from a prototype document, it is now necessary to edit and configure it taking into account the data that we are interested in labeling and extracting, since everything has to be carefully controlled.

The EDIT TEMPLATE form is the template wizard that will lead us through a series of well-defined steps to configure our template. This is the recommended way to edit templates, although you can also use the TEMPLATE form, which is recommended only for experts or to make small modifications.

There are two ways to access to the EDIT TEMPLATE form:

  1. Access Data Extraction/Management Templates/Template edit (prerequisites: we have uploaded and processed the prototype document for creating this template).
  2. After processing prototype document, clicking on the recent template edit name (see previous chapter).

We have to instruct the system to find on the document each field that appears on the left side (have in mind that these fields have been previously defined on the TYPE OF DOCUMENT form).

  • Select one field (grey indicates that has not been instructed): press the icon to add a new search expression.
  • Now the mouse cursor has transformed into a mouse pointer. Select the area where you can find LABEL and VALUE of the desired FIELD.
  • A new window shows the codes found by pointer: drag and drop each code to the corresponding label or value area (inside each area the order is not important)
  • IMPORTANT: Check on USE RECTANGULAR SELECTION box if there is a possibility that the system can find the same label in another area of the document and can obtain an undesired result.
  • Press ADD field when selection has finished. If this option is not checked, the label will be searched anywhere in the document.

The template wizard has created automatic information from our selection that now can be modified. See that when selecting a field, label area will become grey and value area yellow.

  • Select and edit the previous field by clicking on the edit icon: a new window will appear.

New features have appear that can be easily modified in order to adjust possible variations of label expression or value position and dimension.

  • Selection zone X initial: in relation to RECTANGULAR SELECTION BOX.
  • Selection zone X end: in relation to USE RECTANGULAR SELECTION BOX.
  • Selection zone Y initial: in relation to USE RECTANGULAR SELECTION BOX.
  • Selection zone Y end: in relation to USE RECTANGULAR SELECTION BOX.
  • Regular Expression: the system has automatically created the expression of selected label.
  • Offset rows: distance in rows of the value from label.
  • Offset cols: distance in columns of the value from label.
  • Num rows: number of rows occupied by label.
  • Num cols: number of columns occupied by label.
 

Some useful examples

The previous example is the simples case due to the fact that label and value are in the same rows, and both have always the same length and occupies the same space. Here we have more cases.

Case 1 Label and value are on different rows and with fixed length

Template has been divided in imaginary rows and columns, useful when looking label and value located on different rows. In this case, the template wizard has automatically detected the value area, and there is no risk that length of value varies. No modification is needed.
Case 2 Label and value are on different rows and occupy variable length

In this case, the template wizard has automatically detected the value area, but we will increase this area to prevent longer names from being extracted incorrectly.
Case 3 Label is repeated in some part of the text

To avoid incorrect readings, the extraction will be done only in the exact area of the sheet. In this case, Use rectangular selection has been checked during template edition.
This is only useful when there is a possibility that the system can find the same label in another area of the document and can obtain an undesired result.
Case 4 Optional labels to identify one field

It is possible to use various different labels to identify a unique field. For instance, it is useful when two possible languages are used. In this case invoice can be loaded in both languages, as all fields labels have been defined in these languages.

   

Regular expressions to assign templates

Now that the new template has been created, we can add regular expression to automatically identify different documents in order to assign the adequate template after splitting loading files.

You can use one or more regular expressions types: the system will try the first expression in each document and will stop as soon as recognition is made. If not recognition is made, the system will try to find the second expression an so on. If no expression is found, no template will be assigned and document can't be processed.

  • Access Document templates through the Template Tab or the Menu.
  • Fill in one or more fields from EXPRESSION TO IDENTIFY THE TEMPLATE block.
    • Expression to extract the ID: regular expressions to share between templates. (Used for optimizing performance)
      The Group of extraction ID indicates which block/group of the regular expression contains the ID.
      Then the system will check if this text coincides with the identification declared on Document Id field.
    • Index of expression: the identifier is a simple text to find on document data. For example: the name of a company.
    • Regexp expression: the identifier is a single regular expression with no group. The use of regular expression in front of Index of expression is indicated when the ID includes variables.

The example above uses a regular expression to identify one supplier by means of the ID/NIF number. But this supplier uses three format cases in his documents:

NIF:46.777.888S

DNI:46.777.888

NIF: 46.777.888 S

To avoid these undesired format variations that will make template assignation fail, use this regular expression

(PROVIDER:\s*)(\w*)

and indicate that the id value is in group 2 (which corresponds to 46.777.888).

If you are not used to regular expressions, there are multiple free online applications that will help you.
As an example you can use https://regex101.com.

 

2.2 Formats

 

2.2.1 Require specific formats

The FORMAT form allows creating specific numeric and date formats that will be applied in fields of your template during control data extraction.

  • Access Management of templates/Formats
  • Format: identification code.
  • Name: free text description.
  • Base*: format type, that is, Number or Date.
  • Search pattern: number format.
  • Format Date: date format. Use standard nomenclature of day(d), month(M) and year(y).

Examples:

dd-MM-yy corresponds to 01-12-19

dd MMM yyy corresponds to 01 Dec 2019

Once the format has been defined, it is necessary to access the template and assign desired format to the required field:

  • Access Management of templates/Templates through the menu.
  • Look for your template in query screen.
  • Access Fields tab to access to field options.
  • Select the field and insert the identification code of the format in format field.
 

2.2.2 Data replacement

Review pending

This form has also another useful feature: data extracted from document will be automatically replaced by another desired expression. This way you can modify it before it arrives to the intermediate/destiny table (previous to transfer).

  • Select desierd field.
  • Indicate on Replace source the format of the field in the document (use regular expressions).
  • And indicate on the Replace target the new format for the extracted data to be transferred to database.

Example: we want to avoid additional unwanted spaces between the characters.

Replace source: \s

Replace target:

 

2.3 Checking correct functioning of the template

The correct files partition to detect documents and the efficiency of the template are the most critical elements of the Axional OCR data extraction process. It is likely that it will be necessary to test templates several times before finding all possible cases in the documents to be processed.

To verify that template works correctly, it will be necessary to access the form Documents.

  • Access Data extraction/Management of templates/Documents through the menu or by means of the file name from Repository OCR files form.

TO DO

This section is incomplete and will be concluded as soon as possible.
   

3 Internal integration

Each type of document is assigned to an Integration Configuration, which controls the data transfer process.

  • Access Internal Integration/Configuration form through the menu.
  • Fill in Integration Configuration block fields:
    • Code*: unique name in uppercase.
    • Description*: free description text.
    • Head table*: intermediate/destiny table where extracted data will be transferred.
    • Serial column head*: name of column in head table that identifies each transferred row.
  • Sentence SQL/SXL/JS FOR COMPLETE DATA: permits creating new statements to insert/erase/modify data from the intermediate/destiny table after having performed the transfer process. This way data table can me modified previous to execute transfer process to final database.

    In the example below, when PROVIDER CIF value corresponds to value 33AAFCD6883Q12X, then it will be replaced by value "Darshita Aashiyana Private". If PROVIDER CIF value doesn't correspond to that value, then it will be replaced by "Other Provider".
    Copy
    <xsql-script name='f_integration'>
        <body>
            <select prefix='m_'>
                <columns>provider_cif</columns>
                <from table='ocr_test_form_s'/>
                <where>
                    cabid = ${integrationid}
                </where>
            </select>
        
            <set name='replaceValue'><null/></set>
            
            <if>
                <expr><eq><m_refter/>33AAFCD6883Q12X</eq></expr>
                <then>
                    <set name='replaceValue'>Darshita Aashiyana Private</set>
                </then>
                <else>
                    <set name='replaceValue'>Other Provider</set>
                </else>
            </if>
            
            <update table='ocr_test_form_s'>
                <column name='provider_cif'><replaceValue/></column>
                <where>
                    cabid = ${integrationid}
                </where>
            </update>
    
        </body>
    </xsql-script>
  • Column mapping header: relates columns of the origin table (table previous to transfer) to the intermediate/destiny table.

Now documents are ready to be transferred. After performing transfer process, each row of the origin column is marked as transferred to avoid it to be transferred again in the next transfer process.