User:Tash/Prototyping 02

OCR with Tesseract

To OCR a scanned file in english and create a text file output:

tesseract <inputfilename> <outputfilename>

To OCR a PNG file in english and create a text file output, and then convert it to hocr:

tesseract <inputfilename> <outputfilename> hocr

Independent Research: Retraining Tesseract

Tesseract can be trained to detect new fonts. You can also retrain or alter existing training data, on a character / font level or even on a vocabulary / dictionary level.

1. To train tesseract, download the correct version of Tesseract (3 or higher). Make sure to install it with training tools:

brew install --with-training-tools tesseract

2. Find the 'tessdata' directory, that’s where the training files are:

cd / && sudo find . -iname tessdata
cd ./usr/local/Cellar/tesseract/3.05.01/share/tessdata