Syllabus 20100216
Today we were using the nltk module (Natural language toolkit) to explore some different ways of representing text in Python.
Installing NLTK
1. Download the "tarball" (the .tar.gz file listed under Linux) from the download page
2. Unzip it and run the python setup program.
tar xvzf nltk-2.0b8.tar.gz
cd nltk-2.0b8
sudo python setup.py install
3. Perhaps (probably) you get an error that some packages are missing (like yaml)...
You can use aptitude to install this, then repeat the setup.py step again...
sudo apt-get install python-yaml
4. Downloading the sample materials...
>>> import nltk
>>> nltk.download()
Doing stuff with nltk
import nltk
from book import *
text1
text1[0]
text1[0:25]
text1[-25]
len(text1)
set(text1)
len(set(text1))
sorted(set(text1))
Words that end with "ly":
words = sorted(set(text1))
for w in words
if w.endswith("ly"):
print w
Or you can use Python's handy list comprehension:
[w for w in words if w.endswith("ly")]
Word Graph
Uses python-graphviz, so:
sudo apt-get install python-pygraphviz
first.
wordgraph.py
import nltk
import codecs, sys
inpath = sys.argv[1]
raw = codecs.open(inpath, "r", "utf-8").read()
tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)
bigrams = nltk.bigrams(tokens)
cfd = nltk.ConditionalFreqDist(bigrams)
#for word in cfd.conditions():
# print word, cfd[word].keys()
from pygraphviz import *
g=AGraph(directed=True)
for word in cfd.conditions():
for next in cfd[word].keys():
# cfd[next]
g.add_edge(word, next, label=str(cfd[word][next]))
g.draw('wordgraph.png',prog="dot") # draw to png using circo
To run:
python wordgraphy.py austen.txt
Infinite Texts
infinite_gen.py
import nltk
import sys, random, time
raw = sys.stdin.read()
tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)
bigrams = nltk.bigrams(tokens)
cfd = nltk.ConditionalFreqDist(bigrams)
word = random.choice(tokens)
while True:
print word
time.sleep(0.5)
word = random.choice(cfd[word].keys())
To run:
python infinite_gen.py < austen.txt