|
|
Line 1: |
Line 1: |
| == NLP with NLTK/Python ==
| |
|
| |
|
| *Count number of words (word tokens) : len(text)
| |
| *Count number of distinct words (word types) : len(set(text))
| |
| *The diversity of a text can be found with : len(text) / len(set(text))
| |
| *Dispersion plot : shows you usage of certain words in time (useful for quick overviews) (i.e. text.dispersion_plot(['of','the']))
| |
| *Collocations : 2 words that are almost always together (i.e. red wine) text.collocations()
| |
| *Join/split to create strings/lists from delimiters
| |
| *All the words starting with B in text 5. Sorted and unique words only : sorted([w for w in set(text5) if w.startswith('b')])
| |
| *Find exact occurrence of a word = text.index('word')
| |
| *Find all 4-letter words in a text :
| |
| <source lang='python'>
| |
| V = set(text8)
| |
| fourletter = [w for w in V if len(w)==4]
| |
| sorted(fourletter)
| |
| </source>
| |
|
| |
| *And show their distribution in order
| |
| <source lang='python'>
| |
| fdist = FreqDist(text5)
| |
| vocab = fdist.keys()
| |
| for w in vocab:
| |
| if len(w)==4:
| |
| print w
| |
| </source>
| |
|
| |
| *Find all words containing 'ma' in them, sorted.
| |
| <source lang='python'>
| |
| res = sorted([w for w in set(text) if 'ma' in w])
| |
| </source>
| |
|
| |
| *How often a given word occurs in a text, expressed as a percentage
| |
| <source lang='python'>
| |
| fdist = FreqDist(text)
| |
| fdist['word']/len(text)
| |
| </source>
| |
|
| |
| *Find occurences of a word, in context : text.concordance("term")
| |
|
| |
|
| |
| == Gutenberg stuff ==
| |
|
| |
|
| |
| *To access raw text : len(gutenberg.raw('blake-poems.txt'). This returns the letters, including spaces, instead of words. macbeth_sentences = gutenberg.sents('shakespeare-macbeth.txt') would split things up in sentences. We can also use the words() method to break things into words : emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))
| |
|
| |
| *Find certain words in a text, and how many times they appear
| |
| <source lang='python'>
| |
| from nltk.corpus import brown
| |
| import nltk
| |
|
| |
| news_text = brown.words(categories="news")
| |
| fdist = nltk.FreqDist([w.lower() for w in news_text])
| |
| modals = ['what','where','who','why']
| |
| for m in modals:
| |
| print m + " : ", fdist[m]
| |
| </source>
| |
| result :
| |
| <source lang='python'>
| |
| what : 95
| |
| where : 59
| |
| who : 268
| |
| why : 14
| |
| </source>
| |
|
| |
| *Plot a graph to show the use of the words 'america' and 'citizen'
| |
|
| |
| <source lang='python'>
| |
| cfd = nltk.ConditionalFreqDist(
| |
| (target, fileid[:4])
| |
| for fileid in inaugural.fileids()
| |
| for w in inaugural.words(fileid)
| |
| for target in ['america', 'citizen']
| |
| if w.lower().startswith(target))
| |
| cfd.plot()
| |
| </source>
| |