User:Tash/Prototyping 02: Difference between revisions
Line 18: | Line 18: | ||
==Independent Research: Retraining Tesseract== | ==Independent Research: Retraining Tesseract== | ||
Tesseract can be trained to detect new fonts. You can also retrain or alter existing training data, on a character / font level or even on a word / dictionary level. | Tesseract can be trained to detect new fonts. You can also retrain or alter existing training data, on a character / font level or even on a word / dictionary level. | ||
To train tesseract, download the correct version of Tesseract (3 or higher). Make sure to install it with training tools: | |||
<source lang="bash"> | <source lang="bash"> | ||
brew install --with-training-tools tesseract | brew install --with-training-tools tesseract | ||
</source> | </source> | ||
2. Find the 'tessdata' directory, that’s where the training files are: | OPTION 1: MAKE NEW TRAINING DATA | ||
1. Download a box editor program to make a box file & image file that tell Tesseract how to recognize symbols and characters. | |||
E.g. Moshpytt.py. Problem with Moshpytt on Mac: there are a few dependencies you have to download, like pyGTK. | |||
I did this both through brew and pip, but somehow running moshpytt.py still returned the error: No module found. | |||
The internet suggests this is because I'm using my ''Mac’s'' system python, which can’t find the ''brew'' pygtk file. | |||
I'd have to somehow fix this using a PATH command - but haven't managed to figure this out yet | |||
2. Use the box editor to set how you want Tesseract to (mis)interpret characters. Save the file and then run the automatic training script, which will do the following: | |||
# The order of execution is: | |||
# * Work out the language and a list of fonts present | |||
# * Generate .tr files from each boxfile | |||
# * Concatenate all .tr and .box files for each font into | |||
# single files | |||
# * Run unicharset_extractor on the boxfiles | |||
# * Run mftraining and cntraining | |||
# * Rename the output files to include the language prefix | |||
# * Run combine_tessdata on all the generated files | |||
# * Move the lang.traineddata file to the tesseract / tessdata directory. | |||
OPTION 2: ALTER EXISTING TRAINING DATA | |||
Knowing the above, you can also go in via the backdoor. Instead of making new files, extract the existing training data, and change the .tr and .box files of each font. | |||
1. Find the 'tessdata' directory, that’s where the training files are: | |||
<source lang="bash"> | <source lang="bash"> | ||
cd / && sudo find . -iname tessdata | cd / && sudo find . -iname tessdata | ||
Line 30: | Line 53: | ||
</source> | </source> | ||
2. In /tessdata you'll find combined packages of training files in the languages you have download. E.g. eng.traineddata. | |||
To extract and unpack the components: | To extract and unpack the components: | ||
<source lang="bash"> | <source lang="bash"> | ||
Line 36: | Line 59: | ||
</source> | </source> | ||
This will create the separate files in the same directory, including a series of 'DAWG' files: Directed Acrylic Word Graphs. These are dictionary files used by Tesseract during the OCR process to help it determine if the string of characters it has identified as a word is correct. This might be interesting to hack, because if the confidence that Tesseract has in the characters in a word is sufficiently low so that changing these characters will cause the “word” to be changed into something that exists in the dictionary, Tesseract will make the correction. You can turn any word list into a DAWG file using Tesseract’s ''wordlist2dawg'' utility. The word list files must be .txt files with one word per line. | This will create the separate files in the same directory, including a unicharset file containing boxfiles, and a series of 'DAWG' files: Directed Acrylic Word Graphs. These are dictionary files used by Tesseract during the OCR process to help it determine if the string of characters it has identified as a word is correct. This might be interesting to hack, because if the confidence that Tesseract has in the characters in a word is sufficiently low so that changing these characters will cause the “word” to be changed into something that exists in the dictionary, Tesseract will make the correction. You can turn any word list into a DAWG file using Tesseract’s ''wordlist2dawg'' utility. The word list files must be .txt files with one word per line. | ||
To turn DAWG files back into .txt files | To turn DAWG files back into .txt files | ||
Line 49: | Line 72: | ||
<lang>.freq-dawg: | <lang>.freq-dawg: | ||
A dawg made from the most frequent words which would have gone into word-dawg. | A dawg made from the most frequent words which would have gone into word-dawg. | ||
3. Run unicharset_extractor on the boxfiles, which should be called /eng.unicharset | |||
Edit and resave them using box editor | |||
4. Once you have your new files, pack them back into a new set of <lang>.traineddata file. | |||
To use new training data: | |||
<source lang="bash"> | |||
tesseract -l <lang>.<nameoftraineddata> <inputfilename> <outputfilename> | |||
</source> |
Revision as of 20:39, 21 January 2018
OCR with Tesseract
To OCR a scanned file in english and create a text file output:
tesseract <inputfilename> <outputfilename>
To OCR a PNG file in english and create a text file output, and then convert it to hocr:
tesseract <inputfilename> <outputfilename> hocr
Independent Research: Retraining Tesseract
Tesseract can be trained to detect new fonts. You can also retrain or alter existing training data, on a character / font level or even on a word / dictionary level. To train tesseract, download the correct version of Tesseract (3 or higher). Make sure to install it with training tools:
brew install --with-training-tools tesseract
OPTION 1: MAKE NEW TRAINING DATA 1. Download a box editor program to make a box file & image file that tell Tesseract how to recognize symbols and characters. E.g. Moshpytt.py. Problem with Moshpytt on Mac: there are a few dependencies you have to download, like pyGTK. I did this both through brew and pip, but somehow running moshpytt.py still returned the error: No module found. The internet suggests this is because I'm using my Mac’s system python, which can’t find the brew pygtk file. I'd have to somehow fix this using a PATH command - but haven't managed to figure this out yet
2. Use the box editor to set how you want Tesseract to (mis)interpret characters. Save the file and then run the automatic training script, which will do the following:
- The order of execution is:
- * Work out the language and a list of fonts present
- * Generate .tr files from each boxfile
- * Concatenate all .tr and .box files for each font into
- single files
- * Run unicharset_extractor on the boxfiles
- * Run mftraining and cntraining
- * Rename the output files to include the language prefix
- * Run combine_tessdata on all the generated files
- * Move the lang.traineddata file to the tesseract / tessdata directory.
OPTION 2: ALTER EXISTING TRAINING DATA Knowing the above, you can also go in via the backdoor. Instead of making new files, extract the existing training data, and change the .tr and .box files of each font.
1. Find the 'tessdata' directory, that’s where the training files are:
cd / && sudo find . -iname tessdata
cd ./usr/local/Cellar/tesseract/3.05.01/share/tessdata
2. In /tessdata you'll find combined packages of training files in the languages you have download. E.g. eng.traineddata. To extract and unpack the components:
combine_tessdata -u eng.traineddata <prefix>
This will create the separate files in the same directory, including a unicharset file containing boxfiles, and a series of 'DAWG' files: Directed Acrylic Word Graphs. These are dictionary files used by Tesseract during the OCR process to help it determine if the string of characters it has identified as a word is correct. This might be interesting to hack, because if the confidence that Tesseract has in the characters in a word is sufficiently low so that changing these characters will cause the “word” to be changed into something that exists in the dictionary, Tesseract will make the correction. You can turn any word list into a DAWG file using Tesseract’s wordlist2dawg utility. The word list files must be .txt files with one word per line.
To turn DAWG files back into .txt files
dawg2wordlist eng.unicharset eng.word-dawg wordlistfile.txt
There are several different types of DAWG files and each is optional so you can replace only the ones you want. These two are most common: <lang>.word-dawg: A dawg made from dictionary words from the language.
<lang>.freq-dawg: A dawg made from the most frequent words which would have gone into word-dawg.
3. Run unicharset_extractor on the boxfiles, which should be called /eng.unicharset Edit and resave them using box editor
4. Once you have your new files, pack them back into a new set of <lang>.traineddata file. To use new training data:
tesseract -l <lang>.<nameoftraineddata> <inputfilename> <outputfilename>