Revision as of 15:51, 11 January 2018

install

Tesseract (with languages you will be using)

Mac brew install tesseract --all-languages
Debian/Ubuntu: sudo aptitude install tesseract-ocr
- See what language packages are available with: sudo aptitude search tesseract-ocr-
- install language packages: sudo aptitude install tesseract-ocr-ara tesseract-ocr-port tesseract-ocr-spa here I am installing Arabic, Portuguese, Spanish

poppler-utils whic include tools such as pdftotext and pdftohtml

Mac brew install poppler-utils
Debian/Ubuntu: sudo aptitude install poppler-utils

imagemagick

Mac brew install imagemagick
Debian/Ubuntu: sudo aptitude install imagemagick

pdftk

Mac brew install pdftk
Debian/Ubuntu: sudo aptitude install pdftk

PDF

with text layer
without text layer

To find out the difference you can try to select the PDF's text in a PDF viewer. Only if the text layer is present will you be able to select it.

If it contains a text layer you can use pdftotext command-line application (from poppler-utils) to convert the PDF to text

Tesseract

Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google.^[1]

Tesseract is a Free software OCR package

one page prototype

Getting 1 page from PDF file with PDFTK burst

pdftk yourfile.pdf burst

Chose page you want to convert

Convert PDF to bit-map using imagemagick, with some options to optimize OCR

convert -density 300 page.pdf -depth 8 -strip -background white -alpha off ouput.tiff

-density 300 resolution 300DPI. Lower resolutions will create errors :)
-depth 8number of bits for color. 8bit depth == grey-scale
-strip -background white -alpha off removes alpha channel (opacity), and makes the background white
output.tiffin previous versions Tesseract only accepted images as tiffs, but currently more bitmap formats are accepted

OCR

tesseract output.tiff -l eng output

Will generate the file output.txt

-l is the option for language (English is the default)

details

language

Lists all tesseract languages available in your system.

tesseract --list-langs

Select more than language

tesseract output.tiff -l eng+spa output

multipages

Tiff files can be multi-page images. Hence if we use the prevoious IM command to convert a PDF to a TIFF, if the PDF is multi page, so will be it TIFF. Which Tesseract should handle.

$ tesseract TypewriterArt.tiff TypewriterArt
Tesseract Open Source OCR Engine v3.03 with Leptonica
Page 1 of 8
Page 2 of 8
Page 3 of 8

Another option is providing Tesseract with a text file containing the path/filename to each image in sequence:

p001.tiff
p002.tiff
p003.png

tesseract savedlist output

training

Box file editor: moshPyTT for Tesseract v.3.0

See more Tesseract add-on in https://github.com/tesseract-ocr/tesseract/wiki/AddOns

Artistic research

"Reverse OCR"

References

↑ https://github.com/tesseract-ocr/tesseract/blob/master/README.md

[1] ttps://github.com/tesseract-ocr/tesseract/blob/master/README.md

[1]

@@ Line 1: / Line 1: @@
-* [https://code.google.com/p/tesseract-ocr/ Tesseract] is a Free software OCR package
+==install==
+Tesseract (with languages you will be using)
+* Mac <code>brew install tesseract --all-languages</code>
+* Debian/Ubuntu: <code>sudo aptitude install tesseract-ocr</code>
+** See what language packages are available with: <code>sudo aptitude search tesseract-ocr-</code>
+** install language packages: <code>sudo aptitude install tesseract-ocr-ara tesseract-ocr-port tesseract-ocr-spa</code> here I am installing Arabic, Portuguese, Spanish
+poppler-utils whic include tools such as pdftotext and pdftohtml
+* Mac <code>brew install poppler-utils</code>
+* Debian/Ubuntu: <code>sudo aptitude install poppler-utils</code>
+imagemagick
+* Mac <code>brew install imagemagick</code>
+* Debian/Ubuntu: <code>sudo aptitude install imagemagick</code>
+pdftk
+* Mac <code>brew install pdftk</code>
+* Debian/Ubuntu: <code>sudo aptitude install pdftk</code>
+=PDF=
+* with text layer
+* without text layer
+To find out the difference you can try to select the PDF's text in a PDF viewer. Only if the text layer is present will you be able to select it.
+If it contains a text layer you can use pdftotext command-line application (from poppler-utils) to convert the PDF to text
+= Tesseract=
+<blockquote>Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google.<ref>https://github.com/tesseract-ocr/tesseract/blob/master/README.md</ref></blockquote>
+[https://code.google.com/p/tesseract-ocr/ Tesseract] is a Free software OCR package
+==one page prototype==
+Getting 1 page from PDF file with PDFTK <code>burst</code>
+ pdftk yourfile.pdf burst
+Chose page you want to convert
+Convert PDF to bit-map using imagemagick, with some options to optimize OCR
+ convert -density 300 page.pdf -depth 8 -strip -background white -alpha off ouput.tiff
+* <code>-density 300</code> resolution 300DPI. Lower resolutions will create errors :)
+* <code>-depth 8</code>number of bits for color. 8bit depth <nowiki>==</nowiki> grey-scale
+* <code>-strip -background white -alpha off</code> removes alpha channel (opacity), and makes the background white
+* <code>output.tiff</code>in previous versions Tesseract only accepted images as tiffs, but currently more bitmap formats are accepted
+OCR
+ tesseract output.tiff -l eng output
+Will generate the file output.txt
+* -l is the option for language (English is the default)
+==details==
+===language===
+Lists all tesseract languages available in your system.
+ tesseract --list-langs
+Select more than language
+ tesseract output.tiff -l eng+spa output
+==multipages==
+Tiff files can be multi-page images. Hence if we use the prevoious  IM command to convert a PDF to a TIFF, if the PDF is multi page, so will be it TIFF. Which Tesseract should handle.
+<source lang="bash">
+$ tesseract TypewriterArt.tiff TypewriterArt
+Tesseract Open Source OCR Engine v3.03 with Leptonica
+Page 1 of 8
+Page 2 of 8
+Page 3 of 8
+</source>
+Another option is providing Tesseract with a text file containing the path/filename to each image in sequence:
+<source lang="bash">
+p001.tiff
+p002.tiff
+p003.png
+</source>
+tesseract savedlist output
+==training==
+Box file editor: [https://code.google.com/archive/p/moshpytt/downloads moshPyTT] for Tesseract v.3.0
+See more Tesseract add-on in https://github.com/tesseract-ocr/tesseract/wiki/AddOns
 == Artistic research ==
@@ Line 6: / Line 94: @@
 [http://reverseocr.tumblr.com/ "Reverse OCR"]
+==References==
+<references/>

Optical character recognition with Tesseract: Difference between revisions