The package hocr provides functions which allows to convert documents in HOCR format to text documents.

The HORC format ia a representation in HTML of the texts of a page positioned with absolute coordinates. It is usually used as result of OCR analysis of images and includes the physical coordinates in which each text was found.

This functions allows to analyze the documents in HOCR format and obtain a document which shows this texts positionned correctly using the transformation of the HOCR physical coordinates in row-column coordinates in the resulting text document.

 

1 Hocr