Syllabus 2010 t2 p3

From XPUB & Lens-Based wiki

Problem Set 2.3

1: Word Freq Clouds

Using dictionaries is a way to convert a list of "word cloud" where the sizes of words reflect the frequency of that word in a given text.

In class we worked with the following code:

import codecs, re

text = codecs.open("yourtextfile.txt", "r").read()
text = text.lower() # squash uppercase!
words = re.findall(r"\b[a-z]+\b", text, re.I)

# print words

count = {}
for w in words:
    if w not in count:
        count[w] = 1
    else:
        count[w] += 1

from pprint import pprint
pprint(count)


Think about how you can order words in your tag cloud. You can not directly sort a dictionary, but you can sort it's keys...

keys = count.keys()
keys.sort()
print keys


Or try sorting the items of a dictionary, which takes advantage of Python's way or sorting tuples...

items = count.items()
items.sort()
print items


Or make an inverted items (tuples of the count then the word):

items = []
for w, n in count.items():
    items.append((n, w)) # note the double parentheses -- the outer are a function call, the inner are a tuple
items.sort
print items


Try making a "reverse" dictionary that has the count values (1, 2, 3, ...) as keys and lists of words as values. Your code could use the pattern:

count2word = {}
for w, n in count.items():
    # YOUR CODE HERE...
    # YOUR CODE HERE...
    # YOUR CODE HERE...
    # YOUR CODE HERE...


2: Graph-ic Headlines

A graph, in programming terms, is a data structure of linked things. In mathematical terms, a graph is a set of vertices (the "things") connected by edges (the linking lines). Graph structures and techniques for working with and analyzing graph structures (graph theory) have become quite important to all things net-related as web pages can be thought of as graphs (with pages as the vertices, and hyperlinks defining the edges). Google search is perhaps the best known product of graph algorithms.

Create (via a dynamic script) a graph of the words of an RSS feed, with edges (links) representing what words follow each other. (This is similar to the graphs we looked at when we talked about NLTK).

Check out PythonGraphviz to get an idea of how to draw graphs with pygraphviz. Basically you just need to tell graphviz all the "edges" of the graph, or all the possible pairs of words that appear next to each other. Based on that, the graphviz program creates a drawing of the graph.

To do this: start by creating a dictionary with words for keys, and whose values are the list of words that can follow the "key" word.

For example for the text:

the man sees the cat


The resulting "next words" dictionary might be:

{ 'the': ['man', 'cat'], 'man': ['sees'], 'cat': [] }


To do this: It may be helpful to convert your original list of words to a list of pairs of words. In the following code, the enumerate function is used to loop over a list, with a numeric index (so that you can refer to the next word as well inside the loop). All enumerate does is "zips" a list with a counter variable that goes from 0 to however many elements (-1) there are:

pairs = []
for (i, w) in enumerate(words):
    if i+1<len(words):
        nextword = words[i+1]
        pairs.append((w, nextword))

print pairs


In addition to making diagrams, a graph structure can also be used to generate new texts based on an existing one. For instance, given a "next words" dictionary (named nextwords) like that above, try the following code:

import random, time

curword = random.choice(nextwords.keys())
while True:
    print curword
    if nextwords[curword]:
        curword = random.choice(nextwords)
    else:
        print
        curword = random.choice(nextwords.keys())
    time.sleep(0.5)