Revision as of 16:08, 24 March 2018

Terminal

Firstly I looked into basic command line functions File:Commands terminal.pdf and their operations for creating a solid base for Python3.

Secondarily I experimented in Terminal how to translate PDF or JPG to .txt files with tesseract and imagemagick (convert).

Input 1

Output 1

Tesseract (with languages you will be using)

imagemagick

How to use it? tesseract - png - name of the txt file

tesseracttest SZAKACS$ tesseract namefile.png text2.txt

Getting 1 page from PDF file with PDFTK burst

pdftk yourfile.pdf burst

Or use imagemagick

convert -density 300 Typewriter\ Art\ -\ Riddell\ Alan.pdf Typewriter-%03d.tiff

Chose page you want to convert

Convert PDF to bit-map using imagemagick, with some options to optimize OCR

convert -density 300 page.pdf -depth 8 -strip -background white -alpha off ouput.tiff

-density 300 resolution 300DPI. Lower resolutions will create errors :)
-depth 8number of bits for color. 8bit depth == grey-scale
-strip -background white -alpha off removes alpha channel (opacity), and makes the background white
output.tiffin previous versions Tesseract only accepted images as tiffs, but currently more bitmap formats are accepted

@@ Line 16: / Line 16: @@
 * Mac <code>brew install imagemagick</code>
+How to use it?
 <code>tesseract - png - name of the txt file</code>
 <code>tesseracttest SZAKACS$ tesseract namefile.png text2.txt</code>
+Getting 1 page from PDF file with PDFTK <code>burst</code>
+ pdftk yourfile.pdf burst
+Or use imagemagick
+ convert -density 300 Typewriter\ Art\ -\ Riddell\ Alan.pdf Typewriter-%03d.tiff
+Chose page you want to convert
+Convert PDF to bit-map using imagemagick, with some options to optimize OCR
+ convert -density 300 page.pdf -depth 8 -strip -background white -alpha off ouput.tiff
+* <code>-density 300</code> resolution 300DPI. Lower resolutions will create errors :)
+* <code>-depth 8</code>number of bits for color. 8bit depth <nowiki>==</nowiki> grey-scale
+* <code>-strip -background white -alpha off</code> removes alpha channel (opacity), and makes the background white
+* <code>output.tiff</code>in previous versions Tesseract only accepted images as tiffs, but currently more bitmap formats are accepted
 = '''Python3'''=