User:Tash/Prototyping 02
OCR with Tesseract
To OCR a scanned file in english and create a text file output:
tesseract <inputfilename> <outputfilename>
To OCR a PNG file in english and create a text file output, and then convert it to hocr:
tesseract <inputfilename> <outputfilename> hocr
Independent Research: Retraining Tesseract
Tesseract can be trained to detect new fonts. You can also retrain or alter existing training data, on a character / font level or even on a vocabulary / dictionary level.
1. To train tesseract, download the correct version of Tesseract (3 or higher). Make sure to install it with training tools:
brew install --with-training-tools tesseract
2. Find the 'tessdata' directory, that’s where the training files are:
cd / && sudo find . -iname tessdata
cd ./usr/local/Cellar/tesseract/3.05.01/share/tessdata