2009 206: Difference between revisions

From XPUB & Lens-Based wiki
 
(5 intermediate revisions by the same user not shown)
Line 5: Line 5:
Today we are working with the text of [[Media:Poe.zip | 10 poems]] by Edgar Allen Poe, from [http://www.gutenberg.org/etext/10031 Project Gutenberg].
Today we are working with the text of [[Media:Poe.zip | 10 poems]] by Edgar Allen Poe, from [http://www.gutenberg.org/etext/10031 Project Gutenberg].


== Processing ==
== Processing: Word counting ==


<source lang=python>
<source lang=python>
Line 27: Line 27:


Now we make a function that takes a file and turns it into a "word count dictionary". Then we can use this function on different poems.
Now we make a function that takes a file and turns it into a "word count dictionary". Then we can use this function on different poems.
<source lang=python>
import sys, re
def countwords(file):
    wc = {}
    for line in file:
        line = line.rstrip()
        words = re.split("[^a-zA-Z]*", line)
        for word in words:
            word=word.lower()
            if word:
                wc[word]=wc.get(word, 0)+1
    return wc
def dump(wc):
    allwords = wc.keys()
    allwords.sort()
    for word in allwords:
        print word, wc[word]
def describe(wc):
    allwords = wc.keys()
    allwords.sort()
    n = len(allwords)
    return str(n) +" words"+" from "+ allwords[0]+" to "+ allwords[-1]
for filename in sys.argv[1:]:
    print filename
    wc = countwords(open(filename))
    print "  " + describe(wc)
</source>
== Word subtraction ==
<source lang="python">
def subtract(wc1, wc2):
  """
return a new dictionary with the contents of wc1, minus all the words appearing in wc2 """
  ...
</source>


== Visualising ==
== Visualising ==


== Interacting ==
def wordcloud(wc):
  """
  outputs HTML (maybe <span> tag wrapped words) with words scaled to reflect their number of occurence
  """
  ...
</source>

Latest revision as of 15:46, 3 March 2009

Toward a navigable text

Acquiring

Today we are working with the text of 10 poems by Edgar Allen Poe, from Project Gutenberg.

Processing: Word counting

import sys, re
wc = {}

for line in sys.stdin:
    line = line.rstrip()
    words = re.split("[^a-zA-Z]*", line)
    for word in words:
        word=word.lower()
        if word:
            wc[word]=wc.get(word, 0)+1


allwords = wc.keys()
allwords.sort()
for word in allwords:
    print word, wc[word]

Now we make a function that takes a file and turns it into a "word count dictionary". Then we can use this function on different poems.

import sys, re

def countwords(file):
    wc = {}
    for line in file:
        line = line.rstrip()
        words = re.split("[^a-zA-Z]*", line)
        for word in words:
            word=word.lower()
            if word:
                wc[word]=wc.get(word, 0)+1
    return wc

def dump(wc):
    allwords = wc.keys()
    allwords.sort()
    for word in allwords:
        print word, wc[word]

def describe(wc):
    allwords = wc.keys()
    allwords.sort()
    n = len(allwords)
    return str(n) +" words"+" from "+ allwords[0]+" to "+ allwords[-1]

for filename in sys.argv[1:]:
    print filename
    wc = countwords(open(filename))
    print "   " + describe(wc)

Word subtraction

def subtract(wc1, wc2):
  """
return a new dictionary with the contents of wc1, minus all the words appearing in wc2 """
  ...

Visualising

def wordcloud(wc):

 """
 outputs HTML (maybe  tag wrapped words) with words scaled to reflect their number of occurence
 """
 ...

</source>