User:Laurier Rochon/prototyping/npl: Difference between revisions

Revision as of 16:23, 18 November 2010

Count number of words (word tokens) : len(text)
Count number of distinct words (word types) : len(set(text))
The diversity of a text can be found with : len(text) / len(set(text))
Dispersion plot : shows you usage of certain words in time (useful for quick overviews) (i.e. text.dispersion_plot(['of','the']))
Collocations : 2 words that are almost always together (i.e. red wine) text.collocations()
Join/split to create strings/lists from delimiters
All the words starting with B in text 5. Sorted and unique words only : sorted([w for w in set(text5) if w.startswith('b')])
Find exact occurrence of a word = text.index('word')
Find all 4-letter words in a text :

V = set(text8)
fourletter = [w for w in V if len(w)==4]
sorted(fourletter)

fdist = FreqDist(text5)
vocab = fdist.keys()
for w in vocab:
	if len(w)==4:
		print w

@@ Line 9: / Line 9: @@
 *All the words starting with B in text 5. Sorted and unique words only : sorted([w for w in set(text5) if w.startswith('b')])
 *Find exact occurrence of a word = text.index('word')
+*Find all 4-letter words in a text :
+<source lang='python'>
+V = set(text8)
+fourletter = [w for w in V if len(w)==4]
+sorted(fourletter)
+</source>
+*And show their distribution in order
+<source lang='python'>
+fdist = FreqDist(text5)
+vocab = fdist.keys()
+for w in vocab:
+	if len(w)==4:
+		print w
+</source>