User:Laurier Rochon/prototyping/nlp
NLP with NLTK/Python
- Count number of words (word tokens) : len(text)
- Count number of distinct words (word types) : len(set(text))
- The diversity of a text can be found with : len(text) / len(set(text))
- Dispersion plot : shows you usage of certain words in time (useful for quick overviews) (i.e. text.dispersion_plot(['of','the']))
- Collocations : 2 words that are almost always together (i.e. red wine) text.collocations()
- Join/split to create strings/lists from delimiters
- All the words starting with B in text 5. Sorted and unique words only : sorted([w for w in set(text5) if w.startswith('b')])
- Find exact occurrence of a word = text.index('word')
- Find all 4-letter words in a text :
V = set(text8)
fourletter = [w for w in V if len(w)==4]
sorted(fourletter)
- And show their distribution in order
fdist = FreqDist(text5)
vocab = fdist.keys()
for w in vocab:
if len(w)==4:
print w
- Find all words containing 'ma' in them, sorted.
res = sorted([w for w in set(text) if 'ma' in w])
- How often a given word occurs in a text, expressed as a percentage
fdist = FreqDist(text)
fdist['word']/len(text)
- Find occurences of a word, in context : text.concordance("term")
Gutenberg stuff
- To access raw text : len(gutenberg.raw('blake-poems.txt'). This returns the letters, including spaces, instead of words. macbeth_sentences = gutenberg.sents('shakespeare-macbeth.txt') would split things up in sentences. We can also use the words() method to break things into words : emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))
- Find certain words in a text, and how many times they appear
from nltk.corpus import brown
import nltk
news_text = brown.words(categories="news")
fdist = nltk.FreqDist([w.lower() for w in news_text])
modals = ['what','where','who','why']
for m in modals:
print m + " : ", fdist[m]
result :
what : 95
where : 59
who : 268
why : 14
- Plot a graph to show the use of the words 'america' and 'citizen'
cfd = nltk.ConditionalFreqDist(
(target, fileid[:4])
for fileid in inaugural.fileids()
for w in inaugural.words(fileid)
for target in ['america', 'citizen']
if w.lower().startswith(target))
cfd.plot()
- Load up the brown corpus, create a frequency distribution, and print out the 'modals' with the number of occurences.
from nltk.corpus import brown
import nltk
news_text = brown.words(categories="news")
fdist = nltk.FreqDist([w.lower() for w in news_text])
modals = ['what','where','who','why']
for m in modals:
print m + " : ", fdist[m]
- Creating pairings with different lists
from nltk.corpus import brown
import nltk
l1 = ["a word","another word","and another again"]
l2 = ["cat 1", "cat 2"]
somevalues = [
(v1, v2)
for v1 in l2
for v2 in l1
]
print somevalues
will output
[('cat 1', 'a word'), ('cat 1', 'another word'), ('cat 1', 'and another again'),
('cat 2', 'a word'), ('cat 2', 'another word'), ('cat 2', 'and another again')]