User:Laurier Rochon/prototyping/npl: Difference between revisions

@@ Line 1: / Line 1: @@
-== NLP with NLTK/Python ==
-*Count number of words (word tokens) : len(text)
-*Count number of distinct words (word types) : len(set(text))
-*The diversity of a text can be found with : len(text) / len(set(text))
-*Dispersion plot : shows you usage of certain words in time (useful for quick overviews) (i.e. text.dispersion_plot(['of','the']))
-*Collocations : 2 words that are almost always together (i.e. red wine) text.collocations()
-*Join/split to create strings/lists from delimiters
-*All the words starting with B in text 5. Sorted and unique words only : sorted([w for w in set(text5) if w.startswith('b')])
-*Find exact occurrence of a word = text.index('word')
-*Find all 4-letter words in a text :
-<source lang='python'>
-V = set(text8)
-fourletter = [w for w in V if len(w)==4]
-sorted(fourletter)
-</source>
-*And show their distribution in order
-<source lang='python'>
-fdist = FreqDist(text5)
-vocab = fdist.keys()
-for w in vocab:
-	if len(w)==4:
-		print w
-</source>
-*Find all words containing 'ma' in them, sorted.
-<source lang='python'>
-res = sorted([w for w in set(text) if 'ma' in w])
-</source>
-*How often a given word occurs in a text, expressed as a percentage
-<source lang='python'>
-fdist = FreqDist(text)
-fdist['word']/len(text)
-</source>
-*Find occurences of a word, in context : text.concordance("term")
-== Gutenberg stuff ==
-*To access raw text : len(gutenberg.raw('blake-poems.txt'). This returns the letters, including spaces, instead of words. macbeth_sentences = gutenberg.sents('shakespeare-macbeth.txt') would split things up in sentences. We can also use the words() method to break things into words : emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))
-*Find certain words in a text, and how many times they appear
-<source lang='python'>
-from nltk.corpus import brown
-import nltk
-news_text = brown.words(categories="news")
-fdist = nltk.FreqDist([w.lower() for w in news_text])
-modals = ['what','where','who','why']
-for m in modals:
-    print m + " : ", fdist[m]
-</source>
-result :
-<source lang='python'>
-what :  95
-where :  59
-who :  268
-why :  14
-</source>
-*Plot a graph to show the use of the words 'america' and 'citizen'
-<source lang='python'>
-cfd = nltk.ConditionalFreqDist(
-	(target, fileid[:4])
-	for fileid in inaugural.fileids()
-	for w in inaugural.words(fileid)
-	for target in ['america', 'citizen']
-		if w.lower().startswith(target))
-cfd.plot()
-</source>

Latest revision as of 18:49, 18 November 2010