Optical character recognition with Tesseract: Difference between revisions

From XPUB & Lens-Based wiki
No edit summary
No edit summary
Line 1: Line 1:
{{Source|echo "hello python"}}
 


==install==
==install==
Line 27: Line 27:


If it contains a text layer you can use pdftotext command-line application (from poppler-utils) to convert the PDF to text  
If it contains a text layer you can use pdftotext command-line application (from poppler-utils) to convert the PDF to text  


= Tesseract=
= Tesseract=
Line 32: Line 33:


[https://code.google.com/p/tesseract-ocr/ Tesseract] is a Free software OCR package
[https://code.google.com/p/tesseract-ocr/ Tesseract] is a Free software OCR package


==one page prototype==
==one page prototype==
Line 51: Line 53:
Will generate the file output.txt
Will generate the file output.txt
* -l is the option for language (English is the default)
* -l is the option for language (English is the default)


==details==
==details==
Line 82: Line 85:
tesseract savedlist output
tesseract savedlist output


===hocr==
Tesseract 3.0x supports a hocr option, which creates [https://en.wikipedia.org/wiki/HOCR horc] file.
HOCR is an HTML+XML (XHTML) file consisting of recognized words and their coordinates.
It atomizes the OCRed text into text elements of different magnitude, such as:
* paragraph "ocr_par"
* line "ocr_line"
* word "ocrx_word"
<ref>‘HOCR - OCR Workflow and Output Embedded in HTML’. n.d. Accessed 11 January 2018. http://kba.cloud/hocr-spec/1.2/.
</ref>





Revision as of 16:37, 11 January 2018


install

Tesseract (with languages you will be using)

  • Mac brew install tesseract --all-languages
  • Debian/Ubuntu: sudo aptitude install tesseract-ocr
    • See what language packages are available with: sudo aptitude search tesseract-ocr-
    • install language packages: sudo aptitude install tesseract-ocr-ara tesseract-ocr-port tesseract-ocr-spa here I am installing Arabic, Portuguese, Spanish

poppler-utils whic include tools such as pdftotext and pdftohtml

  • Mac brew install poppler-utils
  • Debian/Ubuntu: sudo aptitude install poppler-utils

imagemagick

  • Mac brew install imagemagick
  • Debian/Ubuntu: sudo aptitude install imagemagick

pdftk

  • Mac brew install pdftk
  • Debian/Ubuntu: sudo aptitude install pdftk


PDF

  • with text layer
  • without text layer

To find out the difference you can try to select the PDF's text in a PDF viewer. Only if the text layer is present will you be able to select it.

If it contains a text layer you can use pdftotext command-line application (from poppler-utils) to convert the PDF to text


Tesseract

Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google.[1]

Tesseract is a Free software OCR package


one page prototype

Getting 1 page from PDF file with PDFTK burst

pdftk yourfile.pdf burst 

Chose page you want to convert

Convert PDF to bit-map using imagemagick, with some options to optimize OCR

convert -density 300 page.pdf -depth 8 -strip -background white -alpha off ouput.tiff
  • -density 300 resolution 300DPI. Lower resolutions will create errors :)
  • -depth 8number of bits for color. 8bit depth == grey-scale
  • -strip -background white -alpha off removes alpha channel (opacity), and makes the background white
  • output.tiffin previous versions Tesseract only accepted images as tiffs, but currently more bitmap formats are accepted

OCR

tesseract output.tiff -l eng output

Will generate the file output.txt

  • -l is the option for language (English is the default)


details

language

Lists all tesseract languages available in your system.

tesseract --list-langs

Select more than language

tesseract output.tiff -l eng+spa output


multipages

Tiff files can be multi-page images. Hence if we use the prevoious IM command to convert a PDF to a TIFF, if the PDF is multi page, so will be it TIFF. Which Tesseract should handle.

$ tesseract TypewriterArt.tiff TypewriterArt
Tesseract Open Source OCR Engine v3.03 with Leptonica
Page 1 of 8
Page 2 of 8
Page 3 of 8

Another option is providing Tesseract with a text file containing the path/filename to each image in sequence:

p001.tiff
p002.tiff
p003.png

tesseract savedlist output


=hocr

Tesseract 3.0x supports a hocr option, which creates horc file.

HOCR is an HTML+XML (XHTML) file consisting of recognized words and their coordinates.

It atomizes the OCRed text into text elements of different magnitude, such as:

  • paragraph "ocr_par"
  • line "ocr_line"
  • word "ocrx_word"

[2]


training

Box file editor: moshPyTT for Tesseract v.3.0

See more Tesseract add-on in https://github.com/tesseract-ocr/tesseract/wiki/AddOns

Artistic research

tumblr_nekvu9mCc01tif66co1_250.png

"Reverse OCR"


References

  1. https://github.com/tesseract-ocr/tesseract/blob/master/README.md
  2. ‘HOCR - OCR Workflow and Output Embedded in HTML’. n.d. Accessed 11 January 2018. http://kba.cloud/hocr-spec/1.2/.