Optical character recognition with Tesseract: Difference between revisions
Andre Castro (talk | contribs) No edit summary |
|||
Line 1: | Line 1: | ||
* [https://code.google.com/p/tesseract-ocr/ Tesseract] is a Free software OCR package | |||
==install== | |||
Tesseract (with languages you will be using) | |||
* Mac <code>brew install tesseract --all-languages</code> | |||
* Debian/Ubuntu: <code>sudo aptitude install tesseract-ocr</code> | |||
** See what language packages are available with: <code>sudo aptitude search tesseract-ocr-</code> | |||
** install language packages: <code>sudo aptitude install tesseract-ocr-ara tesseract-ocr-port tesseract-ocr-spa</code> here I am installing Arabic, Portuguese, Spanish | |||
poppler-utils whic include tools such as pdftotext and pdftohtml | |||
* Mac <code>brew install poppler-utils</code> | |||
* Debian/Ubuntu: <code>sudo aptitude install poppler-utils</code> | |||
imagemagick | |||
* Mac <code>brew install imagemagick</code> | |||
* Debian/Ubuntu: <code>sudo aptitude install imagemagick</code> | |||
pdftk | |||
* Mac <code>brew install pdftk</code> | |||
* Debian/Ubuntu: <code>sudo aptitude install pdftk</code> | |||
=PDF= | |||
* with text layer | |||
* without text layer | |||
To find out the difference you can try to select the PDF's text in a PDF viewer. Only if the text layer is present will you be able to select it. | |||
If it contains a text layer you can use pdftotext command-line application (from poppler-utils) to convert the PDF to text | |||
= Tesseract= | |||
<blockquote>Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google.<ref>https://github.com/tesseract-ocr/tesseract/blob/master/README.md</ref></blockquote> | |||
[https://code.google.com/p/tesseract-ocr/ Tesseract] is a Free software OCR package | |||
==one page prototype== | |||
Getting 1 page from PDF file with PDFTK <code>burst</code> | |||
pdftk yourfile.pdf burst | |||
Chose page you want to convert | |||
Convert PDF to bit-map using imagemagick, with some options to optimize OCR | |||
convert -density 300 page.pdf -depth 8 -strip -background white -alpha off ouput.tiff | |||
* <code>-density 300</code> resolution 300DPI. Lower resolutions will create errors :) | |||
* <code>-depth 8</code>number of bits for color. 8bit depth <nowiki>==</nowiki> grey-scale | |||
* <code>-strip -background white -alpha off</code> removes alpha channel (opacity), and makes the background white | |||
* <code>output.tiff</code>in previous versions Tesseract only accepted images as tiffs, but currently more bitmap formats are accepted | |||
OCR | |||
tesseract output.tiff -l eng output | |||
Will generate the file output.txt | |||
* -l is the option for language (English is the default) | |||
==details== | |||
===language=== | |||
Lists all tesseract languages available in your system. | |||
tesseract --list-langs | |||
Select more than language | |||
tesseract output.tiff -l eng+spa output | |||
==multipages== | |||
Tiff files can be multi-page images. Hence if we use the prevoious IM command to convert a PDF to a TIFF, if the PDF is multi page, so will be it TIFF. Which Tesseract should handle. | |||
<source lang="bash"> | |||
$ tesseract TypewriterArt.tiff TypewriterArt | |||
Tesseract Open Source OCR Engine v3.03 with Leptonica | |||
Page 1 of 8 | |||
Page 2 of 8 | |||
Page 3 of 8 | |||
</source> | |||
Another option is providing Tesseract with a text file containing the path/filename to each image in sequence: | |||
<source lang="bash"> | |||
p001.tiff | |||
p002.tiff | |||
p003.png | |||
</source> | |||
tesseract savedlist output | |||
==training== | |||
Box file editor: [https://code.google.com/archive/p/moshpytt/downloads moshPyTT] for Tesseract v.3.0 | |||
See more Tesseract add-on in https://github.com/tesseract-ocr/tesseract/wiki/AddOns | |||
== Artistic research == | == Artistic research == | ||
Line 6: | Line 94: | ||
[http://reverseocr.tumblr.com/ "Reverse OCR"] | [http://reverseocr.tumblr.com/ "Reverse OCR"] | ||
==References== | |||
<references/> |
Revision as of 15:51, 11 January 2018
install
Tesseract (with languages you will be using)
- Mac
brew install tesseract --all-languages
- Debian/Ubuntu:
sudo aptitude install tesseract-ocr
- See what language packages are available with:
sudo aptitude search tesseract-ocr-
- install language packages:
sudo aptitude install tesseract-ocr-ara tesseract-ocr-port tesseract-ocr-spa
here I am installing Arabic, Portuguese, Spanish
- See what language packages are available with:
poppler-utils whic include tools such as pdftotext and pdftohtml
- Mac
brew install poppler-utils
- Debian/Ubuntu:
sudo aptitude install poppler-utils
imagemagick
- Mac
brew install imagemagick
- Debian/Ubuntu:
sudo aptitude install imagemagick
pdftk
- Mac
brew install pdftk
- Debian/Ubuntu:
sudo aptitude install pdftk
- with text layer
- without text layer
To find out the difference you can try to select the PDF's text in a PDF viewer. Only if the text layer is present will you be able to select it.
If it contains a text layer you can use pdftotext command-line application (from poppler-utils) to convert the PDF to text
Tesseract
Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google.[1]
Tesseract is a Free software OCR package
one page prototype
Getting 1 page from PDF file with PDFTK burst
pdftk yourfile.pdf burst
Chose page you want to convert
Convert PDF to bit-map using imagemagick, with some options to optimize OCR
convert -density 300 page.pdf -depth 8 -strip -background white -alpha off ouput.tiff
-density 300
resolution 300DPI. Lower resolutions will create errors :)-depth 8
number of bits for color. 8bit depth == grey-scale-strip -background white -alpha off
removes alpha channel (opacity), and makes the background whiteoutput.tiff
in previous versions Tesseract only accepted images as tiffs, but currently more bitmap formats are accepted
OCR
tesseract output.tiff -l eng output
Will generate the file output.txt
- -l is the option for language (English is the default)
details
language
Lists all tesseract languages available in your system.
tesseract --list-langs
Select more than language
tesseract output.tiff -l eng+spa output
multipages
Tiff files can be multi-page images. Hence if we use the prevoious IM command to convert a PDF to a TIFF, if the PDF is multi page, so will be it TIFF. Which Tesseract should handle.
$ tesseract TypewriterArt.tiff TypewriterArt
Tesseract Open Source OCR Engine v3.03 with Leptonica
Page 1 of 8
Page 2 of 8
Page 3 of 8
Another option is providing Tesseract with a text file containing the path/filename to each image in sequence:
p001.tiff
p002.tiff
p003.png
tesseract savedlist output
training
Box file editor: moshPyTT for Tesseract v.3.0
See more Tesseract add-on in https://github.com/tesseract-ocr/tesseract/wiki/AddOns
Artistic research