User:Laurier Rochon/prototyping/npl: Difference between revisions
(Created page with "== NPL with NLTK/Python == *Count number of words (word tokens) : len(text) *Count number of distinct words (word types) : len(set(text)) *The diversity of a text can be found w...") |
|||
Line 9: | Line 9: | ||
*All the words starting with B in text 5. Sorted and unique words only : sorted([w for w in set(text5) if w.startswith('b')]) | *All the words starting with B in text 5. Sorted and unique words only : sorted([w for w in set(text5) if w.startswith('b')]) | ||
*Find exact occurrence of a word = text.index('word') | *Find exact occurrence of a word = text.index('word') | ||
*Find all 4-letter words in a text : | |||
<source lang='python'> | |||
V = set(text8) | |||
fourletter = [w for w in V if len(w)==4] | |||
sorted(fourletter) | |||
</source> | |||
*And show their distribution in order | |||
<source lang='python'> | |||
fdist = FreqDist(text5) | |||
vocab = fdist.keys() | |||
for w in vocab: | |||
if len(w)==4: | |||
print w | |||
</source> |
Revision as of 16:23, 18 November 2010
NPL with NLTK/Python
- Count number of words (word tokens) : len(text)
- Count number of distinct words (word types) : len(set(text))
- The diversity of a text can be found with : len(text) / len(set(text))
- Dispersion plot : shows you usage of certain words in time (useful for quick overviews) (i.e. text.dispersion_plot(['of','the']))
- Collocations : 2 words that are almost always together (i.e. red wine) text.collocations()
- Join/split to create strings/lists from delimiters
- All the words starting with B in text 5. Sorted and unique words only : sorted([w for w in set(text5) if w.startswith('b')])
- Find exact occurrence of a word = text.index('word')
- Find all 4-letter words in a text :
V = set(text8)
fourletter = [w for w in V if len(w)==4]
sorted(fourletter)
- And show their distribution in order
fdist = FreqDist(text5)
vocab = fdist.keys()
for w in vocab:
if len(w)==4:
print w