Syllabus 20100216

Today we were using the nltk module (Natural language toolkit) to explore some different ways of representing text in Python.

Installing NLTK

1. Download the "tarball" (the .tar.gz file listed under Linux) from the download page

2. Unzip it and run the python setup program.

tar xvzf nltk-2.0b8.tar.gz
cd nltk-2.0b8

 
sudo python setup.py install

3. Perhaps (probably) you get an error that some packages are missing (like yaml)... You can use aptitude to install this, then repeat the setup.py step again...

sudo apt-get install python-yaml

4. Downloading the sample materials...

>>> import nltk
>>> nltk.download()

Doing stuff with nltk

import nltk
from book import *

text1
text1[0]
text1[0:25]
text1[-25]
len(text1)
set(text1)
len(set(text1))
sorted(set(text1))

Words that end with "ly":

words = sorted(set(text1))

for w in words
    if w.endswith("ly"):
        print w

Or you can use Python's handy list comprehension:

[w for w in words if w.endswith("ly")]

Word Graph

Uses python-graphviz, so:

sudo apt-get install python-pygraphviz

first.

wordgraph.py

import nltk
import codecs, sys

inpath = sys.argv[1]
raw = codecs.open(inpath, "r", "utf-8").read()

tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)
bigrams = nltk.bigrams(tokens)
cfd = nltk.ConditionalFreqDist(bigrams)
#for word in cfd.conditions():
#    print word, cfd[word].keys()

from pygraphviz import *

g=AGraph(directed=True)
for word in cfd.conditions():
    for next in cfd[word].keys():
        # cfd[next]
        g.add_edge(word, next, label=str(cfd[word][next]))
g.draw('wordgraph.png',prog="dot") # draw to png using circo

To run:

python wordgraphy.py austen.txt

Infinite Texts

infinite_gen.py

import nltk
import sys, random, time

raw = sys.stdin.read()
tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)
bigrams = nltk.bigrams(tokens)
cfd = nltk.ConditionalFreqDist(bigrams)

word = random.choice(tokens)
while True:
    print word
    time.sleep(0.5)
    word = random.choice(cfd[word].keys())

To run:

python infinite_gen.py < austen.txt