Optical character recognition with Tesseract: Difference between revisions
No edit summary |
Andre Castro (talk | contribs) No edit summary |
||
Line 63: | Line 63: | ||
tesseract --list-langs | tesseract --list-langs | ||
If OCRing a document with more than one language Tesseract can use also more than one | |||
tesseract output.tiff -l eng+spa output | tesseract output.tiff -l eng+spa output | ||
Line 86: | Line 86: | ||
tesseract list.txt output | tesseract list.txt output | ||
=== segmentation== | === segmentation=== | ||
Page Segmentation Mode (-psm) directs the layout analysis that Tesseract performs on the page. | Page Segmentation Mode (-psm) directs the layout analysis that Tesseract performs on the page. | ||
Line 93: | Line 93: | ||
From Tesseract man page: | From Tesseract man page: | ||
<source lang=" | <source lang="bash"> | ||
-psm N | -psm N | ||
Set Tesseract to only run a subset of layout analysis and assume a certain form of image. The options for N are: | Set Tesseract to only run a subset of layout analysis and assume a certain form of image. The options for N are: | ||
Line 137: | Line 137: | ||
===HOCR tools=== | ===HOCR tools=== | ||
[https://github.com/jbaiter/hocrviewer-mirador hocrviewer-mirador] | * [https://github.com/jbaiter/hocrviewer-mirador hocrviewer-mirador] | ||
* Python-based box file editor: [https://code.google.com/archive/p/moshpytt/downloads moshPyTT] for Tesseract v.3.0 | |||
<ref>‘HOCR - OCR Workflow and Output Embedded in HTML’. n.d. Accessed 11 January 2018. http://kba.cloud/hocr-spec/1.2/. | <ref>‘HOCR - OCR Workflow and Output Embedded in HTML’. n.d. Accessed 11 January 2018. http://kba.cloud/hocr-spec/1.2/. | ||
</ref> | </ref> | ||
== Training == | |||
Documentation on Training Tessearct <ref>https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</ref> | |||
<blockquote>Tesseract needs to know about different shapes of the same character by having different fonts separated explicitly.</blockquote> | |||
Line 155: | Line 161: | ||
See more Tesseract add-on in https://github.com/tesseract-ocr/tesseract/wiki/AddOns | See more Tesseract add-on in https://github.com/tesseract-ocr/tesseract/wiki/AddOns |
Revision as of 12:33, 12 January 2018
install
Tesseract (with languages you will be using)
- Mac
brew install tesseract --all-languages
- Debian/Ubuntu:
sudo aptitude install tesseract-ocr
- See what language packages are available with:
sudo aptitude search tesseract-ocr-
- install language packages:
sudo aptitude install tesseract-ocr-ara tesseract-ocr-port tesseract-ocr-spa
here I am installing Arabic, Portuguese, Spanish
- See what language packages are available with:
poppler-utils whic include tools such as pdftotext and pdftohtml
- Mac
brew install poppler-utils
- Debian/Ubuntu:
sudo aptitude install poppler-utils
imagemagick
- Mac
brew install imagemagick
- Debian/Ubuntu:
sudo aptitude install imagemagick
pdftk
- Mac
brew install pdftk
- Debian/Ubuntu:
sudo aptitude install pdftk
- with text layer
- without text layer
To find out the difference you can try to select the PDF's text in a PDF viewer. Only if the text layer is present will you be able to select it.
If it contains a text layer you can use pdftotext command-line application (from poppler-utils) to convert the PDF to text
Tesseract
Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google.[1]
Tesseract is a Free software OCR package
one page prototype
Getting 1 page from PDF file with PDFTK burst
pdftk yourfile.pdf burst
Chose page you want to convert
Convert PDF to bit-map using imagemagick, with some options to optimize OCR
convert -density 300 page.pdf -depth 8 -strip -background white -alpha off ouput.tiff
-density 300
resolution 300DPI. Lower resolutions will create errors :)-depth 8
number of bits for color. 8bit depth == grey-scale-strip -background white -alpha off
removes alpha channel (opacity), and makes the background whiteoutput.tiff
in previous versions Tesseract only accepted images as tiffs, but currently more bitmap formats are accepted
OCR
tesseract output.tiff -l eng output
Will generate the file output.txt
- -l is the option for language (English is the default)
Advanced
language
Lists all tesseract languages available in your system.
tesseract --list-langs
If OCRing a document with more than one language Tesseract can use also more than one
tesseract output.tiff -l eng+spa output
multipages
Tiff files can be multi-page images. Hence if we use the prevoious IM command to convert a PDF to a TIFF, if the PDF is multi page, so will be it TIFF. Which Tesseract should handle.
$ tesseract TypewriterArt.tiff TypewriterArt
Tesseract Open Source OCR Engine v3.03 with Leptonica
Page 1 of 8
Page 2 of 8
Page 3 of 8
Another option is providing Tesseract with a text file containing the path/filename to each image in sequence:
list.txt:
p001.tiff
p002.tiff
p003.png
tesseract list.txt output
segmentation
Page Segmentation Mode (-psm) directs the layout analysis that Tesseract performs on the page.
By default, Tesseract automates the page segmentation, but does not perform orientation and script detection.
From Tesseract man page:
-psm N
Set Tesseract to only run a subset of layout analysis and assume a certain form of image. The options for N are:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR.
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
searchable PDF
tesseract input.tiff output -l eng pdf
hocr
Tesseract 3.0x supports a hocr option, which creates horc file.
HOCR is an HTML+XML (XHTML) file consisting of recognized words and their coordinates.
The HOCR file contains all pages as ocr_page elements. with attribute that contains the following fields :
- ppageno: The physical page number
- image: The relative path (from the HOCR file) to the page image
- bbox: The dimensions of the image
class='ocr_page
The OCRed text is atomized into text elements of different magnitude, such as:
- paragraph "ocr_par"
- line "ocr_line"
- word "ocrx_word"
HOCR tools
- hocrviewer-mirador
- Python-based box file editor: moshPyTT for Tesseract v.3.0
Training
Documentation on Training Tessearct [3]
Tesseract needs to know about different shapes of the same character by having different fonts separated explicitly.
box
In cases where the input is a standard text, with a standard font, the result are not bad.
But when dealing with unusual fonts or hand-written scripts Tesseract has the possibility to train it.
Tesseract needs a 'box' file to go with each training image. The box file is a text file that lists the characters in the training image, in order, one per line, with the coordinates of the bounding box around the image. [4]
tesseract input.tiff output -l nld makebox
See more Tesseract add-on in https://github.com/tesseract-ocr/tesseract/wiki/AddOns
Artistic research
Kindle Scanner Peter Purgathofer
References
- ↑ https://github.com/tesseract-ocr/tesseract/blob/master/README.md
- ↑ ‘HOCR - OCR Workflow and Output Embedded in HTML’. n.d. Accessed 11 January 2018. http://kba.cloud/hocr-spec/1.2/.
- ↑ https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract
- ↑ https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-%E2%80%93-Make-Box-Files