User:Tash/Prototyping 02: Difference between revisions

Revision as of 20:05, 21 January 2018

OCR with Tesseract

To OCR a scanned file in english and create a text file output:

tesseract <inputfilename> <outputfilename>

To OCR a PNG file in english and create a text file output, and then convert it to hocr:

tesseract <inputfilename> <outputfilename> hocr

Independent Research: Retraining Tesseract

Tesseract can be trained to detect new fonts. You can also retrain or alter existing training data, on a character / font level or even on a word / dictionary level.

1. To train tesseract, download the correct version of Tesseract (3 or higher). Make sure to install it with training tools:

brew install --with-training-tools tesseract

2. Find the 'tessdata' directory, that’s where the training files are:

cd / && sudo find . -iname tessdata
cd ./usr/local/Cellar/tesseract/3.05.01/share/tessdata

3. In /tessdata you'll find combined packages of training files in the languages you have download. E.g. eng.traineddata. To extract and unpack the components:

combine_tessdata -u eng.traineddata <prefix>

This will create the separate files in the same directory, including a series of 'DAWG' files: Directed Acrylic Word Graphs. These are dictionary files used by Tesseract during the OCR process to help it determine if the string of characters it has identified as a word is correct. This might be interesting to hack, because if the confidence that Tesseract has in the characters in a word is sufficiently low so that changing these characters will cause the “word” to be changed into something that exists in the dictionary, Tesseract will make the correction. You can turn any word list into a DAWG file using Tesseract’s wordlist2dawg utility. The word list files must be .txt files with one word per line.

To turn DAWG files back into .txt files

dawg2wordlist eng.unicharset eng.word-dawg wordlistfile.txt

There are several different types of DAWG files and each is optional so you can replace only the ones you want. These two are most common: <lang>.word-dawg: A dawg made from dictionary words from the language.

<lang>.freq-dawg: A dawg made from the most frequent words which would have gone into word-dawg.

@@ Line 17: / Line 17: @@
 ==Independent Research: Retraining Tesseract==
-Tesseract can be trained to detect new fonts. You can also retrain or alter existing training data, on a character / font level or even on a vocabulary / dictionary level.
+Tesseract can be trained to detect new fonts. You can also retrain or alter existing training data, on a character / font level or even on a word / dictionary level.
 . To train tesseract, download the correct version of Tesseract (3 or higher). Make sure to install it with training tools:
@@ Line 29: / Line 29: @@
 cd ./usr/local/Cellar/tesseract/3.05.01/share/tessdata
 </source>
+. In /tessdata you'll find combined packages of training files in the languages you have download. E.g. eng.traineddata.
+To extract and unpack the components:
+<source lang="bash">
+combine_tessdata -u eng.traineddata <prefix>
+</source>
+This will create the separate files in the same directory, including a series of 'DAWG' files: Directed Acrylic Word Graphs. These are dictionary files used by Tesseract during the OCR process to help it determine if the string of characters it has identified as a word is correct. This might be interesting to hack, because if the confidence that Tesseract has in the characters in a word is sufficiently low so that changing these characters will cause the “word” to be changed into something that exists in the dictionary, Tesseract will make the correction. You can turn any word list into a DAWG file using Tesseract’s ''wordlist2dawg'' utility. The word list files must be .txt files with one word per line.
+To turn DAWG files back into .txt files
+<source lang="bash">
+dawg2wordlist eng.unicharset eng.word-dawg wordlistfile.txt
+</source>
+There are several different types of DAWG files and each is optional so you can replace only the ones you want. These two are most common:
+<lang>.word-dawg:
+A dawg made from dictionary words from the language.
+<lang>.freq-dawg:
+A dawg made from the most frequent words which would have gone into word-dawg.