Install Tesseract 4.0-Ubuntu

From Media Design: Networked & Lens-Based wiki
Jump to navigation Jump to search

1. Installing tesseract 4.0 with training.

Make a folder called Tesseract4 where you will install everything. Get in there. Follow the instructions on https://github.com/tesseract-ocr/tesseract/wiki/Compiling#linux until you reach the leptonica step.

For leptonica:

In the folder Tesseract4, git clone the following:

git clone https://github.com/danbloomberg/leptonica

Then go to the leptonica documentation, according to your OS. Follow step 2:Using autoconf. Run ./configure in this directory to build Makefiles here and in src. Autoconf handles the following automatically:

./autobuild and then ./configure


After this step, follow the instructions with sudo, making sure you are in the Tesseract4 folder. The executable tesseract should now be located in /usr/local/bin.


2. Building the training tools:

https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract#building-the-training-tools (only make training, sudo make...)

they are in /usr/local/bin/ (check if you have the files there)

3. TrainingTesseract 4.00

Create a folder for the training. https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#before-you-start

Download the language files you will use from here in the tessdata folder.

To build the basis for your training data (a.i. the files you need to have) using tesstrain.sh, run the following command (you can change the font):

training/tesstrain.sh --fonts_dir /usr/share/fonts --fontlist "Ubuntu Mono" --lang eng --linedata_only  \
--noextract_font_properties --langdata_dir ../langdata   --tessdata_dir ./tessdata --output_dir ./testoutput/