PythonLabZalan: Difference between revisions
No edit summary |
|||
Line 38: | Line 38: | ||
* <code>-strip -background white -alpha off</code> removes alpha channel (opacity), and makes the background white | * <code>-strip -background white -alpha off</code> removes alpha channel (opacity), and makes the background white | ||
* <code>output.tiff</code>in previous versions Tesseract only accepted images as tiffs, but currently more bitmap formats are accepted | * <code>output.tiff</code>in previous versions Tesseract only accepted images as tiffs, but currently more bitmap formats are accepted | ||
= '''Python3'''= | = '''Python3'''= | ||
Line 45: | Line 46: | ||
To be able to understand how NLTK works I did an intensive python beginners learning week from 26.02.–04.03.2018. | To be able to understand how NLTK works I did an intensive python beginners learning week from 26.02.–04.03.2018. | ||
== Natural Language Tool Kit == | == Natural Language Tool Kit == |
Revision as of 16:41, 24 March 2018
Terminal
Firstly I looked into basic command line functions File:Commands terminal.pdf and their operations for creating a solid base for Python3.
Optical character recognition + Tesseract
Secondarily I experimented in Terminal how to translate PDF or JPG to .txt files with tesseract and imagemagick (convert).
Tesseract (with languages you will be using)
- Mac
brew install tesseract --all-languages
imagemagick
- Mac
brew install imagemagick
How to use it?
tesseract - png - name of the txt file
tesseracttest SZAKACS$ tesseract namefile.png text2.txt
Getting 1 page from PDF file with PDFTK burst
pdftk yourfile.pdf burst
Or use imagemagick
convert -density 300 Typewriter\ Art\ -\ Riddell\ Alan.pdf Typewriter-%03d.tiff
Chose page you want to convert
Convert PDF to bit-map using imagemagick, with some options to optimize OCR
convert -density 300 page.pdf -depth 8 -strip -background white -alpha off ouput.tiff
-density 300
resolution 300DPI. Lower resolutions will create errors :)-depth 8
number of bits for color. 8bit depth == grey-scale-strip -background white -alpha off
removes alpha channel (opacity), and makes the background whiteoutput.tiff
in previous versions Tesseract only accepted images as tiffs, but currently more bitmap formats are accepted
Python3
To be able to understand how NLTK works I did an intensive python beginners learning week from 26.02.–04.03.2018.
Natural Language Tool Kit
For the NLTK text analysis I used one of pages of my reader. First NLTK Analysis in python3 (link to the script) to get different data from the textual input such as (see NLTK analysis outcome):
- Amount of words
- The number of lowercase letters
- The number of uppercase letters
- 10 most common characters
- 10 most common words
- more than 15 character long words of the text
- Amount of Verbs
- Amount of Nouns
- Amount of Adverbs
- Amount of Pronouns
- Amount of Adjectives
- Amount of lines