1 External integration

This section includes relevant information on how to integrate the information into the system so that the Axional OCR application is functional for your specific needs.

1.1 Creating a Type of document

We are going to define a new type of document that we want the system to process and integrate in our database. This means, assign a name and select those data fields that we want the template to look for.

Some of these fields are mandatory, but some of them are extra information that may be useful, but not essential. Later we will use the form Templates to define were the system can find these fields in this specific type of document.

  • Access Data Extraction/Management of templates/Type of Documents.
  • Press Execute/Run icon without specifying any field.
  • Reset form data with Clear button.
  • Define code and name in capital letters.
  • Press INSERT icon to save the type of document.

  • Code*: unique name in uppercase
  • Name*: description text in uppercase
  • Integration: name of the Integration Configuration assigned to this type of document. IMPORTANT: this field is mandatory to perform transfer processes, but the system allows not filling it in for practical reasons. See Integration Integration for further information.

Now that Document Type has been saved, the option FIELDS OF DOCUMENT is available:

  • Define the fields that the system will look for and assign if they are Required (mandatory fields) or not. Use the row insert option to save them.
 

1.2 Load settings

The type of document that we have previously defined will be uploaded by the system using some criteria. For example, if our document type will always have only one page or maybe several.

Imagine that we have defined that the document has only one page: during the PDF file uploading, the system will create as many independent documents as pages the file had.

Access External integration/Load settings and create a new setting. Click on Clear button and fill in the following fields:

Load settings:
  • Source code*: name of the load setting; unique and uppercase.
  • Description of origin: free text description.
  • Folder file entry: folder on server from where the system loads the documents to process. As soon as the files are processed, they will be automatically moved to Processed folder, as a practical way to know which files have been processed or not.
  • Processed folder*: folder on server where the system will move the processed documents.
  • Status*: active or disabled.
Generation:
  • Type of document*: code of documents that correspond to these settings.
  • Type of splitter*: instructs the system on how to divide the files to generate individual documents in order to apply corresponding templates. This is useful because files to be processed can have more than one document (for instance, more than one invoice). Possible options are:
    • Do not split.
    • White page: there is a white page between documents.
    • At the mark: instructs the system on how to divide files if there is a mark on them. This mark has to be an identifying mark, such as a company unique text, a readable barcode or a readable stamp. When this option is selected it is necessary to define Mark partition and Pattern to apply.
    • Each page.
    • Each 2/3/4 pages.
  • Only when At the mark option has been previously selected:
    • Mark partition*: indicates where this mark is located within the document. If the mark is located on the first page, this will indicate the beginning of a new document. By the contrary, if the mark is located on the last page, it indicates the end of the document.
    • Pattern to apply: text or regular expression to identify split point.

Finally, press INSERT icon to save the type of document.

2 Data extraction

To extract data, Axional OCR relies on extraction templates, usually one for each type of document. The creation of a template that includes all possible cases related to the data extraction is the clue for the proper functioning of the process.

2.1 Creating a new template

In order to create a new template, follow the next steps:

  1. Load a prototype document
  2. Process a prototype document
  3. Save a template
  4. Edit and validate the template

2.1.1 Loading a prototype document to create a Template

To create a new template it is necessary to load a valid document as a prototype.

  • Access the External Integration/Repository of OCR Documents.
  • Reset form data with Clear button.
  • Insert your load code in the Load code field, previously created on External integration/Load settings.
  • Insert a new Source name on Source field.
  • On the tab File data, load prototype file on CHOOSE FILE button. This is a mandatory step.
  • Press INSERT icon : a message !Document is not processed will appear on document edit screen. The STATUS now is NEW.

Every time a file is uploaded, the system assigns and ID BATCH number.

In Repository, batches (uploaded documents) can have four states:

  • New: the file has been uploaded but not split.
  • Processed: the file has been split into documents using the load settings.
  • Canceled: split has been canceled by user.
  • Finished: after performing transfer process (some batch document may have been canceled before the transfer process).

2.1.2 Processing the prototype document

  • Press Process button at the screen bottom: the system will process the document. The STATUS now is PROCESSED and file is on server.
  • Press Process documents button at the screen bottom: the system will process the document. The STATUS now is PROCESSED and data is accessible.
  • PROTOTYPE REVIEW. Sometimes the prototype has more than one page or more than one document and needs some split or deletion modifications. Use the buttons on the right to make these changes:

    Joins current page with the previous one in a unique document.
    Joins current page with the next one in a unique document.
    Adds a new document partition between pages.
    Removes/restores page from template. Removed pages can be seen on the Repository OCR, but not on the final documents.
    Marks page as an attachment.

During this process all steps can be undone with the Cancel button.

2.1.3 Saving the new template

Now document is ready to become a template.

  • Press edit icon in order to create the template, since there isn't any template that fits this type of document (only use Search for a template if it has been previously created).
  • A new window shows two new options: write an existing template to update it or, in our case, ASSIGN NEW TEMPLATE. Then write the name and description of the new template.

The new template has been created, but the system describes it as incomplete because no data fields have been located on template. See next chapter in order to EDIT TEMPLATE.

During this process all steps can be undone with the Cancel button.

2.1.4 Editing an existing template

Although we have created our template from a prototype document, it is now necessary to edit and configure it taking into account the data that we are interested in labeling and extracting, since everything has to be carefully controlled.

The form EDIT TEMPLATE is the template wizard that will lead us through a series of well-defined steps to configure our template. This is the recommended way to edit templates, although you can also use the form TEMPLATE, which is recommended only for experts or to make small modifications.

There are two ways to access to the form EDIT TEMPLATE:

  1. Access Data Extraction/Management Templates/Template edit in the tab TEMPLATE CONFIGURATION. (Prerequisites: we have uploaded and processed the prototype document for creating this template).
  2. After processing prototype document, clicking on the recent template edit name (see previous chapter).

We have to instruct the system to find on the document those fields that appears on the left side (have in mind that these fields have been previously defined on the form TYPE OF DOCUMENT. IMPORTANT: it is not mandatory to define all fields, but at least those defined as required.

  • Select one field (grey indicates that has not been defined): press the icon to add a new search expression.
  • Now the mouse cursor has transformed into a mouse pointer. Select the area where you can find LABEL and VALUE of the desired FIELD.
  • A new window shows the codes found by pointer: drag and drop each code to the corresponding label or value area (inside each area the order is not important)
  • IMPORTANT: Check on Use rectangular selection box if there is a possibility that the system can find the same label in another area of the document and can obtain an undesired result.
  • Press ADD field when selection has finished.

The template wizard has automatically created an extraction form from our selection that can now be modified. See that when selecting a field, label area will become grey and value area yellow.

  • Select and edit the previous field by clicking on the edit icon: a new window will appear.

New features have appeared that can be easily modified in order to adjust possible variations of the label expression or the value position and dimension.

  • Selection zone X initial: in relation to RECTANGULAR SELECTION BOX.
  • Selection zone X end: in relation to USE RECTANGULAR SELECTION BOX.
  • Selection zone Y initial: in relation to USE RECTANGULAR SELECTION BOX.
  • Selection zone Y end: in relation to USE RECTANGULAR SELECTION BOX.
  • Regular Expression: the system has automatically created the expression of selected label.
  • Offset rows: distance in rows of the value from label.
  • Offset cols: distance in columns of the value from label.
  • Num rows: number of rows occupied by label.
  • Num cols: number of columns occupied by label.

Tab OCR Text

During template editing it can be very useful to see what data and where the system detects its. From the form EDIT TEMPLATE you can access the tab OCR TEXT, which shows exactly how the application founds the information.

Some useful examples

The previous example is the simplest case due to the fact that label and value are in the same rows, and both have always the same length and occupies the same space. Here we have more cases.

Case 1 Label and value are on different rows and with fixed length

Template has been divided in imaginary rows and columns, useful when searching for the label and the value located on different rows. In this case, the template wizard has automatically detected the value area, and there is no risk that length of value will vary. No modification is needed.
Case 2 Label and value are on different rows and occupy variable length

In this case, the template wizard has automatically detected the value area, but we will increase this area to prevent longer names from being extracted incorrectly. For that reason we have increased the number of columns up to 47.
Case 3 Label is repeated somewhere in the text

To avoid incorrect readings, the extraction will be done only in the exact area of the sheet. In this case, Use rectangular selection has been checked during template edition.
This is only useful when there is a possibility that the system can find the same label in another area of the document and can obtain an undesired result.
Case 4 Optional labels to identify one field

It is possible to use more than one label to identify a unique field. For instance, it is useful when two possible languages are used. In this case invoice can be loaded in both languages, as all fields labels have been defined in these languages.

Automatically assign templates to documents: regular expressions

Now that the new template has been created, we can add regular expression to automatically identify different documents in order to assign the adequate template after splitting loaded files.

Regular expressions provide a very flexible way to search or recognize text strings within the document.

For example, use your supplier's ID, NIF or CIF to easily identify his invoices.

You can use more than one expression to identify the document: the system will try the first option in each document and will stop as soon as a recognition is made. If not recognition is made, the system will try to find the second option an so on. If no coincidence is found, no template will be assigned and document can't be validated.

  • Access Document templates through the Tab Templates or the Menu.
  • Fill in one or more fields from the block EXPRESSION TO IDENTIFY THE TEMPLATE:
    • Expression to extract the ID: regular expressions to share between templates, used to optimize performance. For this reason, it's called "by group". The Group of extraction ID field indicates which block/group of the regular expression contains the ID.
      Then the system will check if this text coincides with the identification declared on Document Id field.
    • Index of expression: the identifier is a simple text to find on document data. For example: the name of a company.
    • Regexp expression: the identifier is a single regular expression with no group. The use of regular expression in front of Index of expression is indicated when the ID includes variables.

The example above uses a regular expression to identify one supplier by means of the CIF number.

If you are not used to regular expressions, there are multiple free online applications that will help you. As an example you can use https://regex101.com.

2.2 Formats

2.2.1 Extraction with specific format

The form FORMAT allows creating specific numeric and date formats that will be applied in fields of your template during control data extraction.

  • Access Management of templates/Formats
  • Format: identification code.
  • Name: free text description.
  • Base*: format type, that is, Number or Date.
  • Search pattern: number format.
  • Format Date: date format. Use standard nomenclature of day(d), month(M) and year(y).

Examples:

dd-MM-yy corresponds to 01-12-19

dd MMM yyy corresponds to 01 Dec 2019

Once the format has been defined, it is necessary to access the template and assign desired format to the required field:

  • Access Management of templates/Templates through the menu.
  • Look for your template in query screen.
  • Access Fields tab to access to field options.
  • Select the field and insert the identification code of the format in the field Format.

2.2.2 Data replacement

This tab, FIELDS, has also another useful feature: data extracted from document will be automatically replaced by another desired expression. This way you can modify it before it arrives to the destiny table (before the transfer process).

Expression to be searched

Procedure:

  • Select desired field.
  • Indicate on Replace source field the format of the field in the document (use regular expressions).
  • And indicate on the Replace target field the new format for the extracted data to be transferred to database.

For example, if we want to avoid additional unwanted spaces between the characters use these expressions:

  • Replace source: \s
  • Replace target:

Dictionary of replacement of values

In the tab Dictionary of replacement of values, the value of the field must match exactly the indicated value, in order to be replaced by a specific new one. No regular expressions can be used.

  • Select desired field.
  • Indicate a priority (or by default)
  • Indicate Original value.
  • Indicate Replacement value.

2.3 Checking the correct functioning of the template

The correct file partition to detect documents and the efficiency of the template are the most critical elements of the Axional OCR data extraction process. It is likely that it will be necessary to test templates several times before finding all possible cases in the documents to be processed.

  • STEP 1: load again the prototype file through the form External Integration/Repository OCR files.
  • STEP 2: check if the splitter worked correctly.
    If NOT, review and modify Load Settings and try to load it again (see section Load settings for further information).
  • STEP 3: check the document status through the Document extraction/Management of templates/Documents.
    • New: no template has been assigned and consequently the data extraction process has not yet taken place. SOLUTION: review and modify regular expression or text that automatically identifies templates. When finished, try the Search for a template option in the form Document, so the system will try to find and apply the adequate template.
    • Incomplete: some fields show the Error message because data extraction was not possible. SOLUTION: check these fields through the form Repository of OCR. Review and modify template. When done, try the Apply template option in the form Document, so the system will try the modified template.

      REMEMBER

      If template has been modified in order to solve errors or make improvements, it is necessary to press Apply template button to repeat the data extraction process.
    • Canceled: the user has cancelled the previous process. SOLUTION: check the different processes that the system offers you at the bottom of the screen.
    • Validated: all fields show the Validated message because data extraction was correct. The document is ready to be transferred.
    • Transferred: document data has been correctly transferred to the destiny table.
 

3 Internal integration

Each type of document is assigned to an Integration Configuration, which controls the data transfer process.

3.1 Defining Internal Configuration

  • Access Internal Integration/Configuration form through the menu:
  • Fill in Integration Configuration block fields:
    • Code*: unique name in uppercase.
    • Description*: free description text.
    • Head table*: destiny table where extracted data will be transferred.
    • Serial column head*: name of column in head table that identifies each transferred row.
  • Sentence SQL/SXL/JS FOR COMPLETE DATA: permits creating new statements to modify data from the destiny table after having performed the transfer process. This way data table can me modified previous to execute transfer process to final database.

    Two different expressions can be used to identify the document:
    • ${integrationid}
    • ${docid}

    In the example below, when PROVIDER CIF value corresponds to value 33AAFCD6883Q12X, then it will be replaced by value "Darshita Aashiyana Private". If PROVIDER CIF value doesn't correspond to that value, then it will be replaced by "Other Provider".
    Copy
    <xsql-script name='f_integration'>
        <body>
            <select prefix='m_'>
                <columns>provider_cif</columns>
                <from table='ocr_test_form_s'/>
                <where>
                    cabid = ${integrationid}
                </where>
            </select>
        
            <set name='replaceValue'><null/></set>
            
            <if>
                <expr><eq><m_refter/>33AAFCD6883Q12X</eq></expr>
                <then>
                    <set name='replaceValue'>Darshita Aashiyana Private</set>
                </then>
                <else>
                    <set name='replaceValue'>Other Provider</set>
                </else>
            </if>
            
            <update table='ocr_test_form_s'>
                <column name='provider_cif'><replaceValue/></column>
                <where>
                    cabid = ${integrationid}
                </where>
            </update>
    
        </body>
    </xsql-script>
  • Column mapping header: relates columns of the origin table (table previous to transfer) to the destiny table.

3.2 Applying Internal Configuration to document

  • Access and select the Type of Document and select one configuration in the Integration field.
  • Save modifications on Type of Document.

3.3 Transferring Data

Now documents are ready to be transferred.

  • Access Data Extraction/Management of templates/Documents form and select the document.
  • Press Transfer option at the bottom of the screen.

After performing transfer process, each row of the origin column is marked as transferred to avoid it to be transferred again in the next transfer process.

3.3.1 Checking the correct functioning of the transfer process

  • OPTION 1: check that the Document status is Transferred through the form Documents.
  • OPTION 2: check that the extracted data is in the destiny table. If you dispose of Axional DBStudio, access destiny table and check it.
  • OPTION 3: check that Batch status is Finished.

If one of the previous options is NOT OK, review and modify the INTEGRATION CONFIGURATION.

IMPORTANT

If the INTEGRATION CONFIGURATION has been modified in order to solve errors or make improvements, it is necessary to press the Transfer button to repeat the transfer process.