PythonLabZalan

From XPUB & Lens-Based wiki

Terminal

Firstly I looked into basic command line functions File:Commands terminal.pdf and their operations for creating a solid base for Python3.

Optical character recognition + Tesseract

Secondarily I experimented in Terminal how to translate PDF or JPG to .txt files with tesseract and imagemagick (convert).

Optical character recognition

Input 1
Output 1

Tesseract (with languages you will be using)

  • Mac brew install tesseract --all-languages

imagemagick

  • Mac brew install imagemagick

How to use it?

tesseract - png - name of the txt file

tesseracttest SZAKACS$ tesseract namefile.png text2.txt

Getting 1 page from PDF file with PDFTK burst

pdftk yourfile.pdf burst

Or use imagemagick

convert -density 300 Typewriter\ Art\ -\ Riddell\ Alan.pdf Typewriter-%03d.tiff

Chose page you want to convert

Convert PDF to bit-map using imagemagick, with some options to optimize OCR

  • convert -density 300 page.pdf -depth 8 -strip -background white -alpha off ouput.tiff
  • -density 300 resolution 300DPI. Lower resolutions will create errors :)
  • -depth 8number of bits for color. 8bit depth == grey-scale
  • -strip -background white -alpha off removes alpha channel (opacity), and makes the background white
  • output.tiffin previous versions Tesseract only accepted images as tiffs, but currently more bitmap formats are accepted


Python3

Input 2
Output 2
NLTK Analysis outcome

To be able to understand how NLTK works I did an intensive python beginners learning week from 26.02. – 04.03.2018.

Find my tutorial script notes File:Terminal tutorials.pdf

Natural Language Tool Kit

For the NLTK text analysis I used one of pages of my reader. First NLTK Analysis in python3 (see below) to get different data from the textual input such as (see NLTK analysis outcome):

NLTK Analysis

  • Amount of words
  • The number of lowercase letters
  • The number of uppercase letters
  • 10 most common characters
  • 10 most common words
  • more than 15 character long words of the text
  • Amount of Verbs
  • Amount of Nouns
  • Amount of Adverbs
  • Amount of Pronouns
  • Amount of Adjectives
  • Amount of lines

NLTK Analysis Script

import nltk

from nltk import word_tokenize

from nltk import FreqDist

from nltk.tokenize import sent_tokenize

from sys import stdin,stdout

import re

import sys, string

#importing nltk library word_tokenize

from collections import Counter

text = open ("readertest.txt")
content = text.read()

#importing and reading the content

#print(content)

words = content.split(" ")

#the string content needs to signifier - needs to be splitted to be able to read it, it detects if a new words begins based on the " "


splitting_statistic = sorted (set (words))

# the content is splitted

#print(splitting_statistic)


wordsamount_statistic = f'{len(words)} Amount of the words'

#amount of the words

print(wordsamount_statistic)


string=(content)
count1=0
count2=0
for i in string:
      if(i.islower()):
            count1=count1+1
      elif(i.isupper()):
            count2=count2+1
print("The number of lowercase characters is:")
print(count1)
print("The number of uppercase characters is:")
print(count2)

#counts the lowercase and uppercase letters in the text


fdist = FreqDist(content)

print("10 most common characters:")
print(fdist.most_common(10))

#print out the 10 most common letters


fdist = FreqDist(words)

print("10 most common words:")
print(fdist.most_common(10))

#print out the 10 most common words


#new_list = fdist.most_common()

#print(new_list)


#for word, _ in new_list:  #_ ignores the second variable, dictionary (key, value)
    #print(' ',_)
 
#prints a list of the most common words - how to make it better in one line



def vowel_or_consonants (c):
	if not c.isalpha():
		return 'Neither'
	vowels = 'aeiou'

	if c.lower() in vowels:
		return 'Vowel'

	else:
		return 'Consonant'

#for c in (content):

	#print(c, vowel_or_consonants(c))
   

#print(sent_tokenize(content))

#splitting text into sentences


#for word in (words):
	#print(word)

#control structure, each word in a seperate line


#fdist = FreqDist(words)

#print("hapaxes:")
#print(fdist.hapaxes())

#words that occur once only, the so-called hapaxes


V = set(words)
long_words = [w for w in V if len(w) > 15]

print("printing the more than 15 character long words of the text")
print(sorted(long_words))

#printing the more than 15 character long words of the text


tokenized_content = word_tokenize(content)

#the content is tokenized (nltk library)


statistic3 = nltk.pos_tag(tokenized_content)

#each word becomes a tag if is a verb, noun, adverb, pronoun, adjective)

#print(statistic3)


verbscounter = 0

verblist = []


for word, tag in statistic3:
	if tag in {'VB','VBD','VBG','VBN','VBP','VBZ'}:
		verbscounter = verbscounter + 1
		verblist.append(word)

verb_statistic = f'{verbscounter} Verbs'

# shows the amount of verbs in the text

print(verb_statistic)

print(verblist)

#creating a list from the verb counter



#creating a dictionary from a list



nouncounter = 0

nounlist = []

for word, tag in statistic3:
	if tag in {'NNP','NNS','NN', 'NNPS'}:
		nouncounter = nouncounter + 1 
		nounlist.append(word)

nouns_statistic = f'{nouncounter} Nouns'

#shows the amount of nouns in the text

print(nouns_statistic)

print(nounlist)


verblist2 = verblist

nounlist2 = nounlist

verb_noun_dictionary = {}

for i in range (len(verblist2)):
	verb_noun_dictionary[verblist2[i]] = nounlist2 [i]

verblist_and_nounlists = zip (verblist2, nounlist2)

verb_noun_dictionary = dict(verblist_and_nounlists)

verblist_and_nounlists = dict(zip(verblist2, nounlist2))

print(verblist_and_nounlists)

print(len(verblist))

characters = [words]

#print(words)


'''from itertools import groupby

def n_letter_dictionary(string):
    result = {}
    for key, group in groupby(sorted(string.split(), key = lambda x: len(x)), lambda x: len(x)):
        result[key] = list(group)
    return result

 print(n_letter_dictionary)'''


adverbscounter = 0

adverblist = []

for word, tag in statistic3:
	if tag in {'RB','RBR','RBS','WRB'}:
		adverbscounter = adverbscounter + 1
		adverblist.append(word)


adverb_statistic = f'{adverbscounter} Adverbs'

#shows the amount of adverbs in the text

print(adverb_statistic)
print(adverblist)


pronounscounter = 0
pronounslist = []

for word, tag in statistic3:
	if tag in {'PRP','PRP$'}:
		pronounscounter = pronounscounter + 1
		pronounslist.append(word)

pronoun_statistic = f'{pronounscounter} Pronouns'

#shows the amount of pronouns in the text

print(pronoun_statistic)

print(pronounslist)



adjectivscounter = 0

adjectivslist = []

for word, tag in statistic3:
	if tag in {'JJ','JJR','JJS'}:
		adjectivscounter = adjectivscounter + 1
		adjectivslist.append(word)

adjectiv_statistic = f'{adjectivscounter} Adjectives'

#shows the amount of adjectives in the text

print(adjectiv_statistic)
print(adjectivslist)

coordinating_conjuction_counter = 0

for word, tag in statistic3:
	if tag in {'CC'}:
		coordinating_conjuction_counter = coordinating_conjuction_counter + 1

coordinating_conjuction_statistic = f'{coordinating_conjuction_counter} Coordinating conjuctions'

#shows the amount of coordinating_conjuction in the text

print(coordinating_conjuction_statistic)


cardinal_number = 0

for word, tag in statistic3:
	if tag in {'CC'}:
		cardinal_number = cardinal_number + 1

cardinal_number_statistic = f'{cardinal_number} Cardinal numbers'

#shows the amount of cardinal_number in the text

print(cardinal_number_statistic)


determiner_counter = 0

for word, tag in statistic3:
	if tag in {'D'}:
		determiner_counter = determiner_counter + 1

determiner_statistic = f'{determiner_counter} Determiners'

#shows the amount of Determiners in the text

print(determiner_statistic)


existential_there_counter = 0

for word, tag in statistic3:
	if tag in {'EX'}:
		existential_there_counter = existential_there_counter + 1

existential_there_statistic = f'{existential_there_counter} Existential there'

#shows the amount of Existential there in the text

print(existential_there_statistic)



foreing_words_counter = 0

for word, tag in statistic3:
	if tag in {'FW'}:
		foreing_words_counter = foreing_words_counter + 1

foreing_words_statistic = f'{foreing_words_counter} Foreing words'

#shows the amount of foreing words in the text

print(foreing_words_statistic)


preposition_or_subordinating_conjunctionlist = []

preposition_or_subordinating_conjunction_counter = 0

for word, tag in statistic3:
	if tag in {'IN'}:
		preposition_or_subordinating_conjunction_counter = preposition_or_subordinating_conjunction_counter + 1
		preposition_or_subordinating_conjunctionlist.append(word)
preposition_or_subordinating_conjunction_statistic = f'{preposition_or_subordinating_conjunction_counter} Preposition or subordinating conjunctions'

#shows the amount of preposition_or_subordinating_conjunction in the text

print(preposition_or_subordinating_conjunction_statistic)

print(preposition_or_subordinating_conjunctionlist)



list_item_marker_counter = 0

for word, tag in statistic3:
	if tag in {'LS'}:
		list_item_marker_counter = list_item_marker_counter + 1

list_item_marker_statistic = f'{list_item_marker_counter} List item markers'

#shows the amount of list item markers in the text

print(list_item_marker_statistic )


modals_counter = 0

for word, tag in statistic3:
	if tag in {'LS'}:
		modals_counter = modals_counter + 1

modals_statistic = f'{modals_counter} Modals'

#shows the amount of modals in the text

print(modals_statistic)


Predeterminer_counter = 0

for word, tag in statistic3:
	if tag in {'PDT'}:
		Predeterminer_counter = Predeterminer_counter  + 1

Predeterminer_statistic = f'{Predeterminer_counter } Predeterminers'

#shows the amount of Predeterminers in the text

print(Predeterminer_statistic)


Possessive_ending_counter = 0

for word, tag in statistic3:
	if tag in {'PDT'}:
		Possessive_ending_counter = Possessive_ending_counter + 1

Possessive_ending_statistic = f'{Possessive_ending_counter} Possessive endings'

#shows the amount of Possessive endings in the text

print(Possessive_ending_statistic)


particle_counter = 0

for word, tag in statistic3:
	if tag in {'RP'}:
		Particle_counter = particle_counter + 1

particle_statistic = f'{particle_counter} Particles'

#shows the amount of Particles endings in the text

print(particle_statistic)


symbol_counter = 0

for word, tag in statistic3:
	if tag in {'SYM'}:
		symbol_counter = symbol_counter + 1

symbol_statistic = f'{symbol_counter} Symbols'

#shows the amount of symbols in the text

print(symbol_statistic)


to_counter = 0

for word, tag in statistic3:
	if tag in {'TO'}:
		to_counter = to_counter + 1

to_statistic = f'{to_counter} to'

#shows the amount of to in the text

print(to_statistic)


interjection_counter = 0

for word, tag in statistic3:
	if tag in {'TO'}:
		interjection_counter = interjection_counter + 1

interjection_statistic = f'{interjection_counter} Interjections'

#shows the amount of interjections in the text

print(interjection_statistic)


Wh_determiner_counter = 0

for word, tag in statistic3:
	if tag in {'TO'}:
		Wh_determiner_counter = Wh_determiner_counter + 1

Wh_determiner_statistic = f'{Wh_determiner_counter} Wh determiners'

#shows the amount of Wh determiners in the text

print(Wh_determiner_statistic)


Wh_pronoun_counter = 0

for word, tag in statistic3:
	if tag in {'TO'}:
		Wh_pronoun_counter = Wh_pronoun_counter + 1

Wh_pronoun_statistic = f'{Wh_pronoun_counter} Wh pronouns'

#shows the amount of Wh pronouns in the text

print(Wh_pronoun_statistic)


Possessive_wh_pronoun_counter = 0

for word, tag in statistic3:
	if tag in {'TO'}:
		Possessive_wh_pronoun_counter  = Possessive_wh_pronoun_counter  + 1

Possessive_wh_pronoun_statistic = f'{Possessive_wh_pronoun_counter} Possessive wh pronouns'

#shows the amount of Possessive wh pronouns in the text

print(Possessive_wh_pronoun_statistic)

dic1 =([len (i) for i in verblist])
print(dic1)

dic2=([len (i) for i in nounlist])
print(dic2)

dic3=([len (i) for i in adjectivslist])
print(dic3)

dic4=([len (i) for i in preposition_or_subordinating_conjunctionlist])
print(dic4)
#print([len (i) for i in verblist_and_nounlists])
#print([len (i) for i in words])



double_numbers1 = []
for n in dic1:
	double_numbers1.append(n*100)
print(double_numbers1)

double_numbers2 = []
for n in dic2:
	double_numbers2.append(n*100)
print(double_numbers2)

double_numbers3 = []
for n in dic3:
	double_numbers3.append(n*100)
print(double_numbers3)

double_numbers4 = []
for n in dic4:
	double_numbers4.append(n*100)
print(double_numbers4)

div_numbers1= []
for n in dic1:
	div_numbers1.append(n/100)
print(div_numbers1)

div_numbers2= []
for n in dic2:
	div_numbers2.append(n/100)
print(div_numbers2)

div_numbers3= []
for n in dic3:
	div_numbers3.append(n/100)
print(div_numbers3)

div_numbers4= []
for n in dic4:
	div_numbers4.append(n/100)
print(div_numbers4)


'''lst1 = [[double_numbers1], [double_numbers2], [double_numbers3], [double_numbers4]]
print((zip(*lst1))[0])'''

'''lst1 = [[double_numbers1], [double_numbers2], [double_numbers3], [double_numbers4]]
lst2 = []
lst2.append([x[0]for x in lst1])
print(lst2 [0])'''

'''lst1 = [[double_numbers1], [double_numbers2], [double_numbers3], [double_numbers4]]
outputlist = []
for values in lst1:
	outputlist.append(values[-1])
print(outputlist)'''


n1 = double_numbers1
n1_a = (n1[0])
print(n1_a)

n2 = double_numbers2
#print(n2[0])

n3 = double_numbers3
#print(n3[0])

n4 = double_numbers4
#print(n4[0])

n5 = double_numbers1
#print(n5[1])

n6 = double_numbers2
#print(n6[1])

n7 = double_numbers3
#print(n7[1])

n8 = double_numbers3
#print(n8[1])

print((n1[0], n2[0]), (n3[0], n4[0]), (n5[1], n6[1]), (n7[1], n8[1]))

n1a = div_numbers1
#print(n1a[0])

n2a = div_numbers2
#print(n2a[0])

n3a = div_numbers3
#print(n3a[0])

n4a = div_numbers4
#print(n4a[0])

print(n1a[0], n2a[0], n3a[0], n4a[0])

text_file = open ("Output.txt", "w")

text_file.write(n1_a)
text_file.close()




wordsnumber_statistic = len(content.split()) 

#number of words

#print(wordsnumber_statistic)


numberoflines_statistic = len(content.splitlines()) 

#number of lines

print("Number of lines:")
print(numberoflines_statistic)


numberofcharacters_statistic = len(content) 

#number of characters

print("Number of characters:")
print(numberofcharacters_statistic)


d ={}

for word in words:
	d[word] = d.get(word, 0) + 1 

#how many times a word accuers in the text, not sorted yet(next step)

#print(d)


word_freq =[]

for key, value in d.items():
	word_freq.append((value, key))

#sorted the word count - converting a dictionary into a list

#print(word_freq)


lettercounter = Counter(content)

#counts the letters in the text

#print(lettercounter)


Another Data Analysis Script with taking out the stop words

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import FreqDist

import re, datetime

text = open ("output.txt")
content = text.read()

#print(content)

words = content.split(" ")

splitting_statistic = sorted (set (words))

#print(splitting_statistic)

example_sent = (words)

stop_words = set(stopwords.words ('english'))

#word_tokens = word_tokenize(example_sent)

filtered_sentence = [w for w in example_sent if not w in stop_words]

filtered_sentence = []

for w in example_sent:
	if w not in stop_words:
		filtered_sentence.append(w)

#print(example_sent)
#print(filtered_sentence)

fdist = FreqDist(words)
#print(fdist.most_common(100))


mylist = (words) #init the list

print('Your input file has year dates =' )

for l in mylist:
	match = re.match(r'.*([1-3][0-9]{3})', l)
	if match is not None:
		#then it found a match!
	
		print(match.group(1))




s = open('output.txt','r').read()  # Open the input file

# Program will count the characters in text file
num_chars = len(s)

# Program will count the lines in the text file
num_lines = s.count('\n')

# Program will call split with no arguments
words = s.split()
d = {}
for w in words:
    if w in d:
        d[w] += 1
    else:
        d[w] = 1

num_words = sum(d[w] for w in d)

lst = [(d[w],w) for w in d]
lst.sort()
lst.reverse()

# Program assumes user has downloaded an imported stopwords from NLTK
from nltk.corpus import stopwords # Import the stop word list
from nltk.tokenize import wordpunct_tokenize

stop_words = set(stopwords.words('english')) # creating a set makes the searching faster
print ([word for word in lst if word not in stop_words])

# Program will print the results
print('Your input file has characters = '+str(num_chars))
print('Your input file has lines = '+str(num_lines))
print('Your input file has the following words = '+str(num_words))

print('\n The 100 most frequent words are /n')

i = 1
for count, word in lst[:100]:
    print('%2s. %4s %s' %(i,count,word))
    i+= 1

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
 
example_sent = (content)
 
stop_words = set(stopwords.words('english'))
 
word_tokens = word_tokenize(example_sent)
 
filtered_sentence = [w for w in word_tokens if not w in stop_words]
 
filtered_sentence = []
 
for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)
 
#print(word_tokens)
#print(filtered_sentence)

fdist = FreqDist(filtered_sentence)
#print(fdist.most_common(100))

import nltk 
with open('output.txt', 'r') as f:
    sample = f.read()


sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)

def extract_entity_names(t):
    entity_names = []

    if hasattr(t, 'label') and t.label:
        if t.label() == 'NE':
            entity_names.append(' '.join([child[0] for child in t]))
        else:
            for child in t:
                entity_names.extend(extract_entity_names(child))

    return entity_names

entity_names = []
for tree in chunked_sentences:
    # Print results per sentence
    # print extract_entity_names(tree)

    entity_names.extend(extract_entity_names(tree))

# Print all entity names
print(entity_names)

# Print unique entity names
#print (set(entity_names))

dic1 =([len (i) for i in entity_names])
print(dic1)

double_numbers1 = []
for n in dic1:
	double_numbers1.append(n*100)
print(double_numbers1)

div_numbers1= []
for n in dic1:
	div_numbers1.append(n/100)
print(div_numbers1)

list(zip(*(iter([double_numbers1]),)*3))

#group = lambda t, n: zip(*[t[i::n] for i in range(n)])
#group([1, 2, 3, 4], 2)

#print(group)

input = [double_numbers1]

[input[i:i+n] for i in range(0, len(input), n)]

DrawBot

DrawBot experiment 1
DrawBot experiment 2
DrawBot experiment 3
DrawBot experiment generated through random Data input from Python3

To be able to generate geometric shapes based on the analysis of the textual content I needed to connect Python3 to the drawing software DrawBot, which works on python script.

Rotative Shape Gif in DrawBot

Geometry generated based on random data input from Python

Based on the https://vimeo.com/149247423 from Jost van Rossum


CANVAS = 500
SQUARESIZE = 158
NSQUARES = 50
SQUAREDIST = 6

width = NSQUARES * SQUAREDIST

NFRAMES = 50

for frame in range(NFRAMES):
    newPage(CANVAS, CANVAS)
    frameDuration(1/20)
    
    fill(0, 0, 1, 1)
    rect(0, 0, CANVAS, CANVAS)

    phase = 2 * pi * frame / NFRAMES  # angle in radians
    startAngle = 90 * sin(phase)
    endAngle = 90 * sin(phase + 20 *pi)

    translate(CANVAS/2 - width / 2, CANVAS/2)

    fill(1, 0, 0.5, 0.1)
    

    for i in range(NSQUARES + 1):
        f = i / NSQUARES
        save()
        translate(i * SQUAREDIST, 0)
        scale(0.7, 1)
        rotate(startAngle + f * (endAngle - startAngle))
        rect(-SQUARESIZE/2, -SQUARESIZE/2, SQUARESIZE, SQUARESIZE)
        restore()
        
#saveImage("StackOfSquares7.gif")
import json
from random import randint, random

data=[]

for i in range(100):
	x=randint(0, 1000)
	y=randint(0, 1000)
	w=randint(0, 1000)
	h=randint(0, 1000)
	r=random()
	g=random()
	b=random()
	a=random()
	data.append( (x,y,w,h,r,g,b,a) )

print (json.dumps(data, indent=2))

Random data import with json in DrawBot

import json
data = json.load(open("rdata.json"))
# print (data)

for x,y,w,h r, g, b, a in data:
    print(x,y)
    fill(r, g, b, max(a, 05))
    rect(x, y, w, h)