Revision as of 17:41, 24 March 2018

Terminal

Firstly I looked into basic command line functions File:Commands terminal.pdf and their operations for creating a solid base for Python3.

Optical character recognition + Tesseract

Secondarily I experimented in Terminal how to translate PDF or JPG to .txt files with tesseract and imagemagick (convert).

Optical character recognition

Input 1

Output 1

Tesseract (with languages you will be using)

Mac brew install tesseract --all-languages

imagemagick

Mac brew install imagemagick

How to use it?

tesseract - png - name of the txt file

tesseracttest SZAKACS$ tesseract namefile.png text2.txt

Getting 1 page from PDF file with PDFTK burst

pdftk yourfile.pdf burst

Or use imagemagick

convert -density 300 Typewriter\ Art\ -\ Riddell\ Alan.pdf Typewriter-%03d.tiff

Chose page you want to convert

Convert PDF to bit-map using imagemagick, with some options to optimize OCR

convert -density 300 page.pdf -depth 8 -strip -background white -alpha off ouput.tiff
-density 300 resolution 300DPI. Lower resolutions will create errors :)
-depth 8number of bits for color. 8bit depth == grey-scale
-strip -background white -alpha off removes alpha channel (opacity), and makes the background white
output.tiffin previous versions Tesseract only accepted images as tiffs, but currently more bitmap formats are accepted

Python3

Input 2

Output 2

NLTK Analysis outcome

To be able to understand how NLTK works I did an intensive python beginners learning week from 26.02.–04.03.2018.

Natural Language Tool Kit

For the NLTK text analysis I used one of pages of my reader. First NLTK Analysis in python3 (link to the script) to get different data from the textual input such as (see NLTK analysis outcome):

Amount of words
The number of lowercase letters
The number of uppercase letters
10 most common characters
10 most common words
more than 15 character long words of the text
Amount of Verbs
Amount of Nouns
Amount of Adverbs
Amount of Pronouns
Amount of Adjectives
Amount of lines

DrawBot

ACCP (Analogue Circular Communication Protocol

@@ Line 38: / Line 38: @@
 * <code>-strip -background white -alpha off</code> removes alpha channel (opacity), and makes the background white
 * <code>output.tiff</code>in previous versions Tesseract only accepted images as tiffs, but currently more bitmap formats are accepted
 = '''Python3'''=
@@ Line 45: / Line 46: @@
 To be able to understand how NLTK works I did an intensive python beginners learning week from 26.02.–04.03.2018.
 == Natural Language Tool Kit ==

PythonLabZalan: Difference between revisions