Revision as of 11:47, 12 January 2018

install

Tesseract (with languages you will be using)

Mac brew install tesseract --all-languages
Debian/Ubuntu: sudo aptitude install tesseract-ocr
- See what language packages are available with: sudo aptitude search tesseract-ocr-
- install language packages: sudo aptitude install tesseract-ocr-ara tesseract-ocr-port tesseract-ocr-spa here I am installing Arabic, Portuguese, Spanish

poppler-utils whic include tools such as pdftotext and pdftohtml

Mac brew install poppler-utils
Debian/Ubuntu: sudo aptitude install poppler-utils

imagemagick

Mac brew install imagemagick
Debian/Ubuntu: sudo aptitude install imagemagick

pdftk

Mac brew install pdftk
Debian/Ubuntu: sudo aptitude install pdftk

PDF

with text layer
without text layer

To find out the difference you can try to select the PDF's text in a PDF viewer. Only if the text layer is present will you be able to select it.

If it contains a text layer you can use pdftotext command-line application (from poppler-utils) to convert the PDF to text

Tesseract

Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google.^[1]

Tesseract is a Free software OCR package

one page prototype

Getting 1 page from PDF file with PDFTK burst

pdftk yourfile.pdf burst

Chose page you want to convert

Convert PDF to bit-map using imagemagick, with some options to optimize OCR

convert -density 300 page.pdf -depth 8 -strip -background white -alpha off ouput.tiff

-density 300 resolution 300DPI. Lower resolutions will create errors :)
-depth 8number of bits for color. 8bit depth == grey-scale
-strip -background white -alpha off removes alpha channel (opacity), and makes the background white
output.tiffin previous versions Tesseract only accepted images as tiffs, but currently more bitmap formats are accepted

OCR

tesseract output.tiff -l eng output

Will generate the file output.txt

-l is the option for language (English is the default)

Advanced

language

Lists all tesseract languages available in your system.

tesseract --list-langs

Select more than language

tesseract output.tiff -l eng+spa output

multipages

Tiff files can be multi-page images. Hence if we use the prevoious IM command to convert a PDF to a TIFF, if the PDF is multi page, so will be it TIFF. Which Tesseract should handle.

$ tesseract TypewriterArt.tiff TypewriterArt
Tesseract Open Source OCR Engine v3.03 with Leptonica
Page 1 of 8
Page 2 of 8
Page 3 of 8

Another option is providing Tesseract with a text file containing the path/filename to each image in sequence:

list.txt:

p001.tiff
p002.tiff
p003.png

tesseract list.txt output

= segmentation

Page Segmentation Mode (-psm) directs the layout analysis that Tesseract performs on the page.

By default, Tesseract automates the page segmentation, but does not perform orientation and script detection.

From Tesseract man page:

       -psm N
           Set Tesseract to only run a subset of layout analysis and assume a certain form of image. The options for N are:

               0 = Orientation and script detection (OSD) only.
               1 = Automatic page segmentation with OSD.
               2 = Automatic page segmentation, but no OSD, or OCR.
               3 = Fully automatic page segmentation, but no OSD. (Default)
               4 = Assume a single column of text of variable sizes.
               5 = Assume a single uniform block of vertically aligned text.
               6 = Assume a single uniform block of text.
               7 = Treat the image as a single text line.
               8 = Treat the image as a single word.
               9 = Treat the image as a single word in a circle.
               10 = Treat the image as a single character.

searchable PDF

tesseract input.tiff output -l eng pdf

hocr

Tesseract 3.0x supports a hocr option, which creates horc file.

HOCR is an HTML+XML (XHTML) file consisting of recognized words and their coordinates.

The HOCR file contains all pages as ocr_page elements. with attribute that contains the following fields :

ppageno: The physical page number
image: The relative path (from the HOCR file) to the page image
bbox: The dimensions of the image

class='ocr_page

The OCRed text is atomized into text elements of different magnitude, such as:

paragraph "ocr_par"
line "ocr_line"
word "ocrx_word"

HOCR tools

hocrviewer-mirador

^[2]

box

In cases where the input is a standard text, with a standard font, the result are not bad.

But when dealing with unusual fonts or hand-written scripts Tesseract has the possibility to train it.

Tesseract needs a 'box' file to go with each training image. The box file is a text file that lists the characters in the training image, in order, one per line, with the coordinates of the bounding box around the image. ^[3]

tesseract input.tiff output -l nld makebox

training

Box file editor: moshPyTT for Tesseract v.3.0

See more Tesseract add-on in https://github.com/tesseract-ocr/tesseract/wiki/AddOns

Artistic research

"Reverse OCR"

Kindle Scanner Peter Purgathofer

References

↑ https://github.com/tesseract-ocr/tesseract/blob/master/README.md
↑ ‘HOCR - OCR Workflow and Output Embedded in HTML’. n.d. Accessed 11 January 2018. http://kba.cloud/hocr-spec/1.2/.
↑ https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-%E2%80%93-Make-Box-Files

[1] ttps://github.com/tesseract-ocr/tesseract/blob/master/README.md

[2] ‘HOCR - OCR Workflow and Output Embedded in HTML’. n.d. Accessed 11 January 2018. http://kba.cloud/hocr-spec/1.2/.

[3] ttps://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-%E2%80%93-Make-Box-Files

[1]

[2]

[3]

@@ Line 173: / Line 173: @@
 ==References==
 <references/>
+[[Category: OuNuPo]]

Optical character recognition with Tesseract: Difference between revisions