1 OCRDocumentText

You can load a text of a PDF document into a JavaScript OCRDocumentText object from any InputStream convertible object (File, Blob, URL, base64 string)

Copy
<script>
    var pdf = new Ax.pdf.Reader(new Ax.net.URL('https://bitbucket.org/deister/axional-docs-resources/raw/master/PDF/sample.pdf')); 
    var ocrText = new Ax.ocr.OCRDocumentText(pdf.getTextFromPage(2))
</script>
2

You can use OCRDocumentText to obtain an object which encapsulates the contextual analysis methods to perform operations of data extraction through the search patterns or througth text positions.

Return Method Description
boolean findText Returns true if finf the string of the search.
integer getMatchRow Returns the row on the text where string is matched.
integer getMatchCol Returns the column on the text where string is matched.
integer getMatchStart Returns the start index on the text where string is matched.
integer getMatchEnd Returns the end index on the text where string is matched.
integer getMatchWidth Returns the number of columns matched.
integer getMatchGroupCount Returns the the number of groups matched on the findText function.
integer getMatchGroup Returns the text matched on a especific group. If not inform the group its return the match of all expression regular.
integer getMatchGroupStart Returns the start index on the text where the specific group has been matched by the findText function.
integer getMatchGroupEnd Returns the end index on the text where the specific group has been matched by the findText function.
integer getNumberOfRows Returns the number of rows on the ocrDocumentText.
String getLineAt Returns an specific row.
String getMatchGroupEnd Returns the beetween delimited selection. Params (row, number of rows, col, number of cols).
String generateMultigroupExpr Returns a regular expression. Function to create a regular expression only informing the words you want to use on the expression.

2 Find text

The findText method returns true if find the text searched or false if not find it.

The string to search can contain regular expressions.

Copy
<script>
    var pdf = new Ax.pdf.PDFReader(new Ax.net.URL('https://bitbucket.org/deister/axional-docs-resources/raw/master/PDF/sample.pdf')); 
    var ocrText = new Ax.ocr.OCRDocumentText(pdf.getTextFromPage(2));
    console.log(ocrText.findText("how boring typing this stuff"))
</script>

3 Getting the information of search value matched

  • The getMatchRow method returns the row on the text where search value has been matched by the findText function.
  • The getMatchCol method returns the column on the text where search value has been matched by the findText function.
  • The getMatchCol method returns the start index on the text where search value has been matched by the findText function.
  • The getMatchCol method returns the end index on the text where search value has been matched by the findText function.
  • The getMatchWidth method returns the number of columns matched.

The finText function has to return true to use all the functions explained above.

Copy
<script>
    var pdf = new Ax.pdf.PDFReader(new Ax.net.URL('https://bitbucket.org/deister/axional-docs-resources/raw/master/PDF/sample.pdf')); 
    var ocrText = new Ax.ocr.OCRDocumentText(pdf.getTextFromPage(2));
    if(ocrText.findText("how boring typing this stuff")) {
        console.log("ROW MATCH          :" + ocrText.getMatchRow());
        console.log("COL MATCH          :" + ocrText.getMatchCol());
        console.log("START INDEX MATCH  :" + ocrText.getMatchStart());
        console.log("END INDEX MATCH    :" + ocrText.getMatchEnd());
        console.log("MATCH WIDTH        :" + ocrText.getMatchWidth());
    }
</script>
ROW MATCH          :6
COL MATCH          :24
START INDEX MATCH  :226
END INDEX MATCH    :254
MATCH WIDTH        :28

4 Getting the information of a group matched

The string to search on finText function can containt expression regular, thus it can contain groups. To get the information of a group matched use next functions:

  • The getMatchGroupCount method returns the number of groups matched on the findText function.
  • The getMatchGroup method returns the text matched on a especific group. If not inform the group its return the match of all expression regular.
  • The getMatchGroupStart method returns the start index on the text where the specific group has been matched by the findText function.
  • The getMatchGroupEnd method returns the end index on the text where the specific group has been matched by the findText function.

The finText function has to return true to use all the functions explained above.

Copy
<script>
    var pdf = new Ax.pdf.PDFReader(new Ax.net.URL('https://bitbucket.org/deister/axional-docs-resources/raw/master/PDF/sample.pdf')); 
    var ocrText = new Ax.ocr.OCRDocumentText(pdf.getTextFromPage(2));
    if(ocrText.findText("(boring).*(this[ ]*stuff)")) {
        console.log("RETURN TEXT MATCHED BY THE REGULAR EXPRESSION  :" + ocrText.getMatchGroup());
        console.log("NUMBER OF GROUP MATCHED                        :" + ocrText.getMatchGroupCount());
        console.log("RETURN TEXT MATCHED ON THE FIRST GROUP         :" + ocrText.getMatchGroup(1));
        console.log("RETURN START INDEX OF THE FIRST GROUP          :" + ocrText.getMatchGroupStart(1));
        console.log("RETURN END INDEX OF THE FIRST GROUP            :" + ocrText.getMatchGroupEnd(1));
        console.log("RETURN TEXT MATCHED ON THE SECOND GROUP        :" + ocrText.getMatchGroup(2));
        console.log("RETURN START INDEX OF THE SECOND GROUP         :" + ocrText.getMatchGroupStart(2));
        console.log("RETURN END INDEX OF THE SECOND GROUP           :" + ocrText.getMatchGroupEnd(2));
    }
</script>
RETURN TEXT MATCHED BY THE REGULAR EXPRESSION  :boring typing this stuff
NUMBER OF GROUP MATCHED                        :2
RETURN TEXT MATCHED ON THE FIRST GROUP         :boring
RETURN START INDEX OF THE FIRST GROUP          :230
RETURN END INDEX OF THE FIRST GROUP            :236
RETURN TEXT MATCHED ON THE SECOND GROUP        :this stuff
RETURN START INDEX OF THE SECOND GROUP         :244
RETURN END INDEX OF THE SECOND GROUP           :254

5 Get text by position

The ocrDocumentText class provides functions to extract text using index. The functions are next:

  • The getNumberOfRows method returns the number of rows on the ocrDocumentText.
  • The getLineAt method returns an specific row .
  • The getTextRect method returns the beetween delimited selection. Params (row, number of rows, col, number of cols).

The finText function has to return true to use all the functions explained above.

Copy
<script>
    var pdf = new Ax.pdf.PDFReader(new Ax.net.URL('https://bitbucket.org/deister/axional-docs-resources/raw/master/PDF/sample.pdf')); 
    var ocrText = new Ax.ocr.OCRDocumentText(pdf.getTextFromPage(2));
    console.log("NUMBER OF ROWS     :" + ocrText.getNumberOfRows());
    console.log("LINE 2             :" + ocrText.getLineAt(4));
    console.log("GET TEXT RECT      :" + ocrText.getTextRect(4,2,20,40));
</script>
NUMBER OF ROWS     :9
LINE 2             :             ...continued from page 1. Yet more text. And more text. And more text.  
GET TEXT RECT      :tinued from page 1. Yet more text. And m
re text. And more text. And more text. A

6 Create automatic regular expression

The ocrDocumentText class provides generateMultigroupExpr function to create a regular expression only informing the words you want to use on the expression.

Copy
<script>
    console.log("REGULAR EXPRESSION OF DATE:" + Ax.ocr.OCRDocumentText.generateMultigroupExpr(new Array("Date", "01/03/2019")));
    console.log("REGULAR EXPRESSION OF NUMBER:" + Ax.ocr.OCRDocumentText.generateMultigroupExpr(new Array("Invoice Number:", 321321321)));
    console.log("REGULAR EXPRESSION OF STRING:" + Ax.ocr.OCRDocumentText.generateMultigroupExpr(new Array("Client Name:", "CLIENT NAME")));
</script>
REGULAR EXPRESSION OF DATE:(?:\s*)?(Date)(?:\s*)?(\d+\/\d+\/\d+)?
REGULAR EXPRESSION OF NUMBER:(?:\s*)?(Invoice(?:\s*)?Number\:)(?:\s*)?(\d+)?
REGULAR EXPRESSION OF STRING:(?:\s*)?(Client(?:\s*)?Name\:)(?:\s*)?(\w+(?:\s*)?\w+)?