User:Laurier Rochon/prototyping/npl: Difference between revisions

From XPUB & Lens-Based wiki
No edit summary
(Blanked the page)
 
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
== NPL with NLTK/Python ==


*Count number of words (word tokens) : len(text)
*Count number of distinct words (word types) : len(set(text))
*The diversity of a text can be found with : len(text) / len(set(text))
*Dispersion plot : shows you usage of certain words in time (useful for quick overviews) (i.e. text.dispersion_plot(['of','the']))
*Collocations : 2 words that are almost always together (i.e. red wine) text.collocations()
*Join/split to create strings/lists from delimiters
*All the words starting with B in text 5. Sorted and unique words only : sorted([w for w in set(text5) if w.startswith('b')])
*Find exact occurrence of a word = text.index('word')
*Find all 4-letter words in a text :
<source lang='python'>
V = set(text8)
fourletter = [w for w in V if len(w)==4]
sorted(fourletter)
</source>
*And show their distribution in order
<source lang='python'>
fdist = FreqDist(text5)
vocab = fdist.keys()
for w in vocab:
if len(w)==4:
print w
</source>
*Find all words containing 'ma' in them, sorted.
<source lang='python'>
res = sorted([w for w in set(text) if 'ma' in w])
</source>
*How often a given word occurs in a text, expressed as a percentage
<source lang='python'>
fdist = FreqDist(text)
fdist['word']/len(text)
</source>

Latest revision as of 19:49, 18 November 2010