2009 206: Difference between revisions
(8 intermediate revisions by the same user not shown) | |||
Line 3: | Line 3: | ||
== Acquiring == | == Acquiring == | ||
Today we are working with the text of [[ | Today we are working with the text of [[Media:Poe.zip | 10 poems]] by Edgar Allen Poe, from [http://www.gutenberg.org/etext/10031 Project Gutenberg]. | ||
== Processing == | == Processing: Word counting == | ||
<source lang=python> | <source lang=python> | ||
Line 24: | Line 24: | ||
for word in allwords: | for word in allwords: | ||
print word, wc[word] | print word, wc[word] | ||
</source> | |||
Now we make a function that takes a file and turns it into a "word count dictionary". Then we can use this function on different poems. | |||
<source lang=python> | |||
import sys, re | |||
def countwords(file): | |||
wc = {} | |||
for line in file: | |||
line = line.rstrip() | |||
words = re.split("[^a-zA-Z]*", line) | |||
for word in words: | |||
word=word.lower() | |||
if word: | |||
wc[word]=wc.get(word, 0)+1 | |||
return wc | |||
def dump(wc): | |||
allwords = wc.keys() | |||
allwords.sort() | |||
for word in allwords: | |||
print word, wc[word] | |||
def describe(wc): | |||
allwords = wc.keys() | |||
allwords.sort() | |||
n = len(allwords) | |||
return str(n) +" words"+" from "+ allwords[0]+" to "+ allwords[-1] | |||
for filename in sys.argv[1:]: | |||
print filename | |||
wc = countwords(open(filename)) | |||
print " " + describe(wc) | |||
</source> | |||
== Word subtraction == | |||
<source lang="python"> | |||
def subtract(wc1, wc2): | |||
""" | |||
return a new dictionary with the contents of wc1, minus all the words appearing in wc2 """ | |||
... | |||
</source> | </source> | ||
== Visualising == | == Visualising == | ||
def wordcloud(wc): | |||
""" | |||
outputs HTML (maybe <span> tag wrapped words) with words scaled to reflect their number of occurence | |||
""" | |||
... | |||
</source> |
Latest revision as of 15:46, 3 March 2009
Toward a navigable text
Acquiring
Today we are working with the text of 10 poems by Edgar Allen Poe, from Project Gutenberg.
Processing: Word counting
import sys, re
wc = {}
for line in sys.stdin:
line = line.rstrip()
words = re.split("[^a-zA-Z]*", line)
for word in words:
word=word.lower()
if word:
wc[word]=wc.get(word, 0)+1
allwords = wc.keys()
allwords.sort()
for word in allwords:
print word, wc[word]
Now we make a function that takes a file and turns it into a "word count dictionary". Then we can use this function on different poems.
import sys, re
def countwords(file):
wc = {}
for line in file:
line = line.rstrip()
words = re.split("[^a-zA-Z]*", line)
for word in words:
word=word.lower()
if word:
wc[word]=wc.get(word, 0)+1
return wc
def dump(wc):
allwords = wc.keys()
allwords.sort()
for word in allwords:
print word, wc[word]
def describe(wc):
allwords = wc.keys()
allwords.sort()
n = len(allwords)
return str(n) +" words"+" from "+ allwords[0]+" to "+ allwords[-1]
for filename in sys.argv[1:]:
print filename
wc = countwords(open(filename))
print " " + describe(wc)
Word subtraction
def subtract(wc1, wc2):
"""
return a new dictionary with the contents of wc1, minus all the words appearing in wc2 """
...
Visualising
def wordcloud(wc):
"""
outputs HTML (maybe tag wrapped words) with words scaled to reflect their number of occurence
"""
...
</source>