Revision as of 17:38, 13 January 2018

software

install

Tesseract (with languages you will be using)

Mac brew install tesseract --all-languages
Debian/Ubuntu: sudo aptitude install tesseract-ocr
- See what language packages are available with: sudo aptitude search tesseract-ocr-
- install language packages: sudo aptitude install tesseract-ocr-ara tesseract-ocr-port tesseract-ocr-spa here I am installing Arabic, Portuguese, Spanish

poppler-utils whic include tools such as pdftotext and pdftohtml

Mac brew install poppler-utils
Debian/Ubuntu: sudo aptitude install poppler-utils

imagemagick

Mac brew install imagemagick
Debian/Ubuntu: sudo aptitude install imagemagick

pdftk

Mac brew install pdftk
Debian/Ubuntu: sudo aptitude install pdftk

PDF

with text layer
without text layer

To find out the difference you can try to select the PDF's text in a PDF viewer. Only if the text layer is present will you be able to select it.

If it contains a text layer you can use pdftotext command-line application (from poppler-utils) to convert the PDF to text

Tesseract

Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google.^[1]

Tesseract is a Free software OCR package

one page prototype

Getting 1 page from PDF file with PDFTK burst

pdftk yourfile.pdf burst

Chose page you want to convert

Convert PDF to bit-map using imagemagick, with some options to optimize OCR

convert -density 300 page.pdf -depth 8 -strip -background white -alpha off ouput.tiff

-density 300 resolution 300DPI. Lower resolutions will create errors :)
-depth 8number of bits for color. 8bit depth == grey-scale
-strip -background white -alpha off removes alpha channel (opacity), and makes the background white
output.tiffin previous versions Tesseract only accepted images as tiffs, but currently more bitmap formats are accepted

See Tessearct page on improving quality of images for OCR ^[2]

OCR

tesseract output.tiff -l eng output

Will generate the file output.txt

-l is the option for language (English is the default)

Advanced

language

Lists all tesseract languages available in your system.

tesseract --list-langs

If OCRing a document with more than one language Tesseract can use also more than one

tesseract output.tiff -l eng+spa output

multipages

Tiff files can be multi-page images. Hence if we use the prevoious IM command to convert a PDF to a TIFF, if the PDF is multi page, so will be it TIFF. Which Tesseract should handle.

$ tesseract TypewriterArt.tiff TypewriterArt
Tesseract Open Source OCR Engine v3.03 with Leptonica
Page 1 of 8
Page 2 of 8
Page 3 of 8

Another option is providing Tesseract with a text file containing the path/filename to each image in sequence:

list.txt:

p001.tiff
p002.tiff
p003.png

tesseract list.txt output

segmentation

Page Segmentation Mode (-psm) directs the layout analysis that Tesseract performs on the page.

By default, Tesseract automates the page segmentation, but does not perform orientation and script detection.

From Tesseract man page:

       -psm N
           Set Tesseract to only run a subset of layout analysis and assume a certain form of image. The options for N are:

               0 = Orientation and script detection (OSD) only.
               1 = Automatic page segmentation with OSD.
               2 = Automatic page segmentation, but no OSD, or OCR.
               3 = Fully automatic page segmentation, but no OSD. (Default)
               4 = Assume a single column of text of variable sizes.
               5 = Assume a single uniform block of vertically aligned text.
               6 = Assume a single uniform block of text.
               7 = Treat the image as a single text line.
               8 = Treat the image as a single word.
               9 = Treat the image as a single word in a circle.
               10 = Treat the image as a single character.

searchable PDF

tesseract input.tiff output -l eng pdf

hocr

Tesseract 3.0x supports a hocr option, which creates horc file.

HOCR is an HTML+XML (XHTML) file consisting of recognized words and their coordinates.

The HOCR file contains all pages as ocr_page elements. with attribute that contains the following fields :

ppageno: The physical page number
image: The relative path (from the HOCR file) to the page image
bbox: The dimensions of the image

class='ocr_page

The OCRed text is atomized into text elements of different magnitude, such as:

paragraph "ocr_par"
line "ocr_line"
word "ocrx_word"

HOCR tools

hocrviewer-mirador
Python-based box file editor: moshPyTT for Tesseract v.3.0

^[3]

Training

Extensive documentation on Training Tessearct ^[4]
Tutorial: Adding New Fonts to Tesseract 3 OCR Engine^[5]
Tutorial: A Guide on OCR with tesseract 3.03 ^[6]
Tutorial: How to prepare training files for tessearct-orc and improve character recognition ^[7]

Tesseract needs to know about different shapes of the same character by having different fonts separated explicitly.

tessdata/ dir, where data files can be found, can be found on Debian at /usr/share/tesseract-ocr/tessdata If the dir happens to be located elsewhere you can use the following commands to find it:

cd /
sudo find -type d -name "tessdata"

box ouput

box output is ...!

In cases where the input is a standard text, with a standard font, the result are not bad.

But when dealing with unusual fonts or hand-written scripts Tesseract has the possibility to train it.

Tesseract needs a 'box' file to go with each training image. The box file is a text file that lists the characters in the training image, in order, one per line, with the coordinates of the bounding box around the image. ^[8]

convert -density 300 wafer.pdf -depth 8 -strip -background white -alpha off wafer.tiff

tesseract wafer.tiff wafer makebox

Edit with moshpytt

./moshpytt.py

box editors

Boxmaker - JS online
moshpytt - python

See more Tesseract add-on in https://github.com/tesseract-ocr/tesseract/wiki/AddOns

Fons

A very convoluted way to give Tesseract some help in recognizing a font could be to use OSP Fons - a recipe to make fonts out of bitmap images - to create a font out of the glyphs present in the scanned document.

The resulting font could then be given to Tesseract as a language to help recognize the text in that font.

If you want to use fons you will need to compile some software: Autotracer and Glyphtracer that comes with the OSP repository.

Install other: python-fontforge, fontforge

and use good Gimp/Photoshop skills.

but it might be worth the ride.

Artistic research

"Reverse OCR"

Kindle Scanner Peter Purgathofer

References

↑ https://github.com/tesseract-ocr/tesseract/blob/master/README.md
↑ https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality
↑ ‘HOCR - OCR Workflow and Output Embedded in HTML’. n.d. Accessed 11 January 2018. http://kba.cloud/hocr-spec/1.2/.
↑ https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract
↑ http://michaeljaylissner.com/posts/2012/02/11/adding-new-fonts-to-tesseract-3-ocr-engine/
↑ https://www.joyofdata.de/blog/a-guide-on-ocr-with-tesseract-3-03/
↑ http://pretius.com/how-to-prepare-training-files-for-tesseract-ocr-and-improve-characters-recognition/
↑ https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-%E2%80%93-Make-Box-Files

[1] ttps://github.com/tesseract-ocr/tesseract/blob/master/README.md

[2] ttps://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality

[3] ‘HOCR - OCR Workflow and Output Embedded in HTML’. n.d. Accessed 11 January 2018. http://kba.cloud/hocr-spec/1.2/.

[4] ttps://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract

[5] ttp://michaeljaylissner.com/posts/2012/02/11/adding-new-fonts-to-tesseract-3-ocr-engine/

[6] ttps://www.joyofdata.de/blog/a-guide-on-ocr-with-tesseract-3-03/

[7] ttp://pretius.com/how-to-prepare-training-files-for-tesseract-ocr-and-improve-characters-recognition/

[8] ttps://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-%E2%80%93-Make-Box-Files

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

Optical character recognition with Tesseract: Difference between revisions