Revision as of 17:38, 24 March 2018

Terminal

Firstly I looked into basic command line functions File:Commands terminal.pdf and their operations for creating a solid base for Python3.

Secondarily I experimented in Terminal how to translate PDF or JPG to .txt files with tesseract and imagemagick (convert).

Input 1

Output 1

Tesseract (with languages you will be using)

imagemagick

How to use it?

tesseract - png - name of the txt file

tesseracttest SZAKACS$ tesseract namefile.png text2.txt

Getting 1 page from PDF file with PDFTK burst

pdftk yourfile.pdf burst

Or use imagemagick

convert -density 300 Typewriter\ Art\ -\ Riddell\ Alan.pdf Typewriter-%03d.tiff

Chose page you want to convert

Convert PDF to bit-map using imagemagick, with some options to optimize OCR

convert -density 300 page.pdf -depth 8 -strip -background white -alpha off ouput.tiff
-density 300 resolution 300DPI. Lower resolutions will create errors :)
-depth 8number of bits for color. 8bit depth == grey-scale
-strip -background white -alpha off removes alpha channel (opacity), and makes the background white
output.tiffin previous versions Tesseract only accepted images as tiffs, but currently more bitmap formats are accepted

Input 2

Output 2

NLTK Analysis outcome

First NLTK Analysis in python3 (link to the script) to get different data from the textual input such as (see NLTK analysis):

@@ Line 44: / Line 44: @@
 [[File:Screen Shot 2018-03-24 at 16.12.30.png|thumb|NLTK Analysis outcome]]
-First NLTK Analysis in python3
@@ Line 56: / Line 56: @@
 == Natural Language Tool Kit ==
+First NLTK Analysis in python3 (link to the script) to get different data from the textual input such as (see NLTK analysis):
+*Amount of words
+*The number of lowercase letters
+*The number of uppercase letters
+*10 most common characters
+*10 most common words
+*more than 15 character long words of the text
+*Amount of Verbs
+*Amount of Nouns
+*Amount of Adverbs
+*Amount of Pronouns
+*Amount of Adjectives
+*Amount of lines
 ='''DrawBot'''=
 ='''ACCP (Analogue Circular Communication Protocol'''=