1 OCRDocumentText
You can load a text of a PDF
document into a JavaScript OCRDocumentText object from any InputStream convertible object (File, Blob, URL, base64 string)
<script> var pdf = new Ax.pdf.Reader(new Ax.net.URL('https://bitbucket.org/deister/axional-docs-resources/raw/master/PDF/sample.pdf')); var ocrText = new Ax.ocr.OCRDocumentText(pdf.getTextFromPage(2)) </script>
2
You can use OCRDocumentText to obtain an object which encapsulates the contextual analysis methods to perform operations of data extraction through the search patterns or througth text positions.
Return | Method | Description |
---|---|---|
boolean | findText | Returns true if found the string of the search. |
integer | getMatchRow | Returns the row on the text where string is matched. |
integer | getMatchCol | Returns the column on the text where string is matched. |
integer | getMatchStart | Returns the start index on the text where string is matched. |
integer | getMatchEnd | Returns the end index on the text where string is matched. |
integer | getMatchWidth | Returns the number of columns matched. |
integer | getMatchGroupCount | Returns the the number of groups matched on the findText function. |
String | getMatchGroup | Returns the text matched on a especific group. If the group is not informed, returns the match of all regular expressions. |
integer | getMatchGroupStart | Returns the start index on the text where the specific group has been matched by the findText function. |
integer | getMatchGroupEnd | Returns the end index on the text where the specific group has been matched by the findText function. |
integer | getNumberOfRows | Returns the number of rows on the ocrDocumentText . |
String | getLineAt | Returns an specific row. |
String | getMatchGroupEnd | Returns the beetween delimited selection. Params (row, number of rows, col, number of cols). |
String | generateMultigroupExpr | Returns a regular expression. Function to create a regular expression only informing the words you want to use on the expression. |
2 Find text
The findText
method returns true if found the text searched or false if not found.
The string to search can contain regular expressions.
<script> var pdf = new Ax.pdf.Reader(new Ax.net.URL('https://bitbucket.org/deister/axional-docs-resources/raw/master/PDF/sample.pdf')); var ocrText = new Ax.ocr.OCRDocumentText(pdf.getTextFromPage(2)); console.log(ocrText.findText("how boring typing this stuff")) </script>
3 Getting the information of search value matched
- The
getMatchRow
method returns the row on the text where search value has been matched by the findText function. - The
getMatchCol
method returns the column on the text where search value has been matched by the findText function. - The
getMatchStart
method returns the start index on the text where search value has been matched by the findText function. - The
getMatchEnd
method returns the end index on the text where search value has been matched by the findText function. - The
getMatchWidth
method returns the number of columns matched.
The finText
function has to return true to use all the functions explained above.
<script> var pdf = new Ax.pdf.Reader(new Ax.net.URL('https://bitbucket.org/deister/axional-docs-resources/raw/master/PDF/sample.pdf')); var ocrText = new Ax.ocr.OCRDocumentText(pdf.getTextFromPage(2)); if(ocrText.findText("how boring typing this stuff")) { console.log("ROW MATCH :" + ocrText.getMatchRow()); console.log("COL MATCH :" + ocrText.getMatchCol()); console.log("START INDEX MATCH :" + ocrText.getMatchStart()); console.log("END INDEX MATCH :" + ocrText.getMatchEnd()); console.log("MATCH WIDTH :" + ocrText.getMatchWidth()); } </script>
ROW MATCH :6
COL MATCH :24
START INDEX MATCH :226
END INDEX MATCH :254
MATCH WIDTH :28
4 Getting the information of a group matched
The string to search on finText
function can contain regular expressions, thus it can contain groups. To get the information of a group matched use next functions:
- The
getMatchGroupCount
method returns the number of groups matched on the findText function. - The
getMatchGroup
method returns the text matched on a especific group. If the group is not informed, returns the match of all regular expressions. - The
getMatchGroupStart
method returns the start index on the text where the specific group has been matched by the findText function. - The
getMatchGroupEnd
method returns the end index on the text where the specific group has been matched by the findText function.
The finText
function has to return true to use all the functions explained above.
<script> var pdf = new Ax.pdf.Reader(new Ax.net.URL('https://bitbucket.org/deister/axional-docs-resources/raw/master/PDF/sample.pdf')); var ocrText = new Ax.ocr.OCRDocumentText(pdf.getTextFromPage(2)); if(ocrText.findText("(boring).*(this[ ]*stuff)")) { console.log("RETURN TEXT MATCHED BY THE REGULAR EXPRESSION :" + ocrText.getMatchGroup()); console.log("NUMBER OF GROUP MATCHED :" + ocrText.getMatchGroupCount()); console.log("RETURN TEXT MATCHED ON THE FIRST GROUP :" + ocrText.getMatchGroup(1)); console.log("RETURN START INDEX OF THE FIRST GROUP :" + ocrText.getMatchGroupStart(1)); console.log("RETURN END INDEX OF THE FIRST GROUP :" + ocrText.getMatchGroupEnd(1)); console.log("RETURN TEXT MATCHED ON THE SECOND GROUP :" + ocrText.getMatchGroup(2)); console.log("RETURN START INDEX OF THE SECOND GROUP :" + ocrText.getMatchGroupStart(2)); console.log("RETURN END INDEX OF THE SECOND GROUP :" + ocrText.getMatchGroupEnd(2)); } </script>
RETURN TEXT MATCHED BY THE REGULAR EXPRESSION :boring typing this stuff
NUMBER OF GROUP MATCHED :2
RETURN TEXT MATCHED ON THE FIRST GROUP :boring
RETURN START INDEX OF THE FIRST GROUP :230
RETURN END INDEX OF THE FIRST GROUP :236
RETURN TEXT MATCHED ON THE SECOND GROUP :this stuff
RETURN START INDEX OF THE SECOND GROUP :244
RETURN END INDEX OF THE SECOND GROUP :254
5 Get text by position
The ocrDocumentText
class provides functions to extract text using index. The functions are next:
- The
getNumberOfRows
method returns the number of rows on theocrDocumentText
. - The
getLineAt
method returns an specific row . - The
getTextRect
method returns text the between delimited character offset + character size. Params (row offset, number of rows, col offset, number of cols).
The finText
function has to return true to use all the functions explained above.
<script> var pdf = new Ax.pdf.Reader(new Ax.net.URL('https://bitbucket.org/deister/axional-docs-resources/raw/master/PDF/sample.pdf')); var ocrText = new Ax.ocr.OCRDocumentText(pdf.getTextFromPage(2)); console.log("NUMBER OF ROWS :" + ocrText.getNumberOfRows()); console.log("LINE 2 :" + ocrText.getLineAt(4)); console.log("GET TEXT RECT :" + ocrText.getTextRect(4,2,20,40)); </script>
NUMBER OF ROWS :9
LINE 2 : ...continued from page 1. Yet more text. And more text. And more text.
GET TEXT RECT :tinued from page 1. Yet more text. And m
re text. And more text. And more text. A
6 Create automatic regular expression
The ocrDocumentText
class provides generateMultigroupExpr
function to create a regular expression only informing the words you want to use on the expression.
<script> console.log("REGULAR EXPRESSION OF DATE:" + Ax.ocr.OCRDocumentText.generateMultigroupExpr(new Array("Date", "01/03/2019"))); console.log("REGULAR EXPRESSION OF NUMBER:" + Ax.ocr.OCRDocumentText.generateMultigroupExpr(new Array("Invoice Number:", "321321321"))); console.log("REGULAR EXPRESSION OF STRING:" + Ax.ocr.OCRDocumentText.generateMultigroupExpr(new Array("Client Name:", "CLIENT NAME"))); </script>
REGULAR EXPRESSION OF DATE:(?:\s*)?(Date)(?:\s*)?(\d+\/\d+\/\d+)?
REGULAR EXPRESSION OF NUMBER:(?:\s*)?(Invoice(?:\s*)?Number\:)(?:\s*)?(\d+)?
REGULAR EXPRESSION OF STRING:(?:\s*)?(Client(?:\s*)?Name\:)(?:\s*)?(\w+(?:\s*)?\w+)?