Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a record-keeping system in an office, or to publish the text on a website. OCR makes it possible to edit the text, search for a word or phrase, store it more compactly, display or print a copy free of scanning artifacts, and apply techniques such as machine translation, text-to-speech and text mining to it. OCR is a field of research in pattern recognition, artificial intelligence and computer vision.

The aim of this project is to improve the open source implementation of the hOCR engine from 2005  that base  on tesseract-ocr.

hOCR Problems:

  • the hOCR project seems to be disbanded,  the website is not updated later than July 2008.
  • author does not answer emails.
  • the main hOCR problem was memory leaks. Seems that the latest stable version suffer from
  • major memory leaks under Windows.
  • the library was unable to process any image.

zOCR Fixes:

  • We did manage to solve the memory leak problem, however there are other issues we still could not solve, Mixing hOCR and hSpell under Windows crashes the program. We could not find appropriate solution to that problem except for replacing hSpell with another speller, aSpell.
  • Create a new source from scanner.
  • hand writing issue:
  • It seems that although we fixed the libhocr main issue, we could not get it to work with the Neuronal Network class we wrote, Fix the hand-writing recognition bug.
  • The problem seems to be with libocr graphics process. When we paused the process and tried to transfer the result to the Neuronal Network class, the program would crash.
  • We could not fix this bug in time, the solution is to create a buffer between the libhocr and the Neuronal Network class.
  • aSpell instead of hSpell.


