PythonLabZalan: Difference between revisions
Line 16: | Line 16: | ||
* Mac <code>brew install imagemagick</code> | * Mac <code>brew install imagemagick</code> | ||
How to use it? | |||
<code>tesseract - png - name of the txt file</code> | <code>tesseract - png - name of the txt file</code> | ||
<code>tesseracttest SZAKACS$ tesseract namefile.png text2.txt</code> | <code>tesseracttest SZAKACS$ tesseract namefile.png text2.txt</code> | ||
Getting 1 page from PDF file with PDFTK <code>burst</code> | |||
pdftk yourfile.pdf burst | |||
Or use imagemagick | |||
convert -density 300 Typewriter\ Art\ -\ Riddell\ Alan.pdf Typewriter-%03d.tiff | |||
Chose page you want to convert | |||
Convert PDF to bit-map using imagemagick, with some options to optimize OCR | |||
convert -density 300 page.pdf -depth 8 -strip -background white -alpha off ouput.tiff | |||
* <code>-density 300</code> resolution 300DPI. Lower resolutions will create errors :) | |||
* <code>-depth 8</code>number of bits for color. 8bit depth <nowiki>==</nowiki> grey-scale | |||
* <code>-strip -background white -alpha off</code> removes alpha channel (opacity), and makes the background white | |||
* <code>output.tiff</code>in previous versions Tesseract only accepted images as tiffs, but currently more bitmap formats are accepted | |||
= '''Python3'''= | = '''Python3'''= |
Revision as of 16:08, 24 March 2018
Terminal
Firstly I looked into basic command line functions File:Commands terminal.pdf and their operations for creating a solid base for Python3.
Optical character recognition + Tesseract
Secondarily I experimented in Terminal how to translate PDF or JPG to .txt files with tesseract and imagemagick (convert).
Tesseract (with languages you will be using)
- Mac
brew install tesseract --all-languages
imagemagick
- Mac
brew install imagemagick
How to use it?
tesseract - png - name of the txt file
tesseracttest SZAKACS$ tesseract namefile.png text2.txt
Getting 1 page from PDF file with PDFTK burst
pdftk yourfile.pdf burst
Or use imagemagick
convert -density 300 Typewriter\ Art\ -\ Riddell\ Alan.pdf Typewriter-%03d.tiff
Chose page you want to convert
Convert PDF to bit-map using imagemagick, with some options to optimize OCR
convert -density 300 page.pdf -depth 8 -strip -background white -alpha off ouput.tiff
-density 300
resolution 300DPI. Lower resolutions will create errors :)-depth 8
number of bits for color. 8bit depth == grey-scale-strip -background white -alpha off
removes alpha channel (opacity), and makes the background whiteoutput.tiff
in previous versions Tesseract only accepted images as tiffs, but currently more bitmap formats are accepted