User:Tash/Prototyping 02

From XPUB & Lens-Based wiki
< User:Tash
Revision as of 19:44, 21 January 2018 by Tash (talk | contribs) (Created page with "==OCR with Tesseract== To OCR a scanned file in english and create a text file output: <source lang="bash"> tesseract <inputfilename> <outputfilename> </source> To OCR a P...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

OCR with Tesseract

To OCR a scanned file in english and create a text file output:

tesseract <inputfilename> <outputfilename>

To OCR a PNG file in english and create a text file output, and then convert it to hocr:

tesseract <inputfilename> <outputfilename> hocr


Independent Research: Retraining Tesseract

Tesseract can be trained to detect new fonts. You can also retrain or alter existing training data, on a character / font level or even on a vocabulary / dictionary level.

1. To train tesseract, download the correct version of Tesseract (3 or higher). Make sure to install it with training tools:

brew install --with-training-tools tesseract

2. Find the 'tessdata' directory, that’s where the training files are:

cd / && sudo find . -iname tessdata
cd ./usr/local/Cellar/tesseract/3.05.01/share/tessdata