PythonLabZalan: Difference between revisions

From XPUB & Lens-Based wiki
Line 44: Line 44:
[[File:Screen Shot 2018-03-24 at 16.12.30.png|thumb|NLTK Analysis outcome]]
[[File:Screen Shot 2018-03-24 at 16.12.30.png|thumb|NLTK Analysis outcome]]


First NLTK Analysis in python3
 




Line 56: Line 56:


== Natural Language Tool Kit ==
== Natural Language Tool Kit ==
First NLTK Analysis in python3 (link to the script) to get different data from the textual input such as (see NLTK analysis):
*Amount of words
*The number of lowercase letters
*The number of uppercase letters
*10 most common characters
*10 most common words
*more than 15 character long words of the text
*Amount of Verbs
*Amount of Nouns
*Amount of Adverbs
*Amount of Pronouns
*Amount of Adjectives
*Amount of lines


='''DrawBot'''=
='''DrawBot'''=


='''ACCP (Analogue Circular Communication Protocol'''=
='''ACCP (Analogue Circular Communication Protocol'''=

Revision as of 16:38, 24 March 2018

Terminal

Firstly I looked into basic command line functions File:Commands terminal.pdf and their operations for creating a solid base for Python3.

Optical character recognition + Tesseract

Secondarily I experimented in Terminal how to translate PDF or JPG to .txt files with tesseract and imagemagick (convert).

Optical character recognition

Input 1
Output 1

Tesseract (with languages you will be using)

  • Mac brew install tesseract --all-languages

imagemagick

  • Mac brew install imagemagick

How to use it?

tesseract - png - name of the txt file

tesseracttest SZAKACS$ tesseract namefile.png text2.txt

Getting 1 page from PDF file with PDFTK burst

pdftk yourfile.pdf burst

Or use imagemagick

convert -density 300 Typewriter\ Art\ -\ Riddell\ Alan.pdf Typewriter-%03d.tiff

Chose page you want to convert

Convert PDF to bit-map using imagemagick, with some options to optimize OCR

  • convert -density 300 page.pdf -depth 8 -strip -background white -alpha off ouput.tiff
  • -density 300 resolution 300DPI. Lower resolutions will create errors :)
  • -depth 8number of bits for color. 8bit depth == grey-scale
  • -strip -background white -alpha off removes alpha channel (opacity), and makes the background white
  • output.tiffin previous versions Tesseract only accepted images as tiffs, but currently more bitmap formats are accepted

Python3

Input 2
Output 2
NLTK Analysis outcome







Natural Language Tool Kit

First NLTK Analysis in python3 (link to the script) to get different data from the textual input such as (see NLTK analysis):

  • Amount of words
  • The number of lowercase letters
  • The number of uppercase letters
  • 10 most common characters
  • 10 most common words
  • more than 15 character long words of the text
  • Amount of Verbs
  • Amount of Nouns
  • Amount of Adverbs
  • Amount of Pronouns
  • Amount of Adjectives
  • Amount of lines

DrawBot

ACCP (Analogue Circular Communication Protocol