User:Joca/Software Joca: Difference between revisions
Line 55: | Line 55: | ||
=== Adding tags to the dictionary === | === Adding tags to the dictionary === | ||
After defining the functions, I run a for loop for the value words in the dictionary I made in the first step. New labels made by the taggers, are added as values in the dictionary. | After defining the functions, I run a for loop for the value words in the dictionary I made in the first step. New labels made by the taggers, are added as values in the dictionary. | ||
I found out that some taggers gave a different output when I used a list of tokens as an input, instead of one word. This is because some taggers, like the NLTK POS tagger and the Sentiment.vader tagger, use the label of the previous token as an extra source of information. | |||
In this case I run the function for all words at once. In the for loop I connect the items in this list to the right keys in the dictionary using their index, which is the same as the ID in the dictionary keys. | |||
<pre> | <pre> | ||
POS_tags = POS_tagger(words) | |||
sentiment_tags = sentiment_tagger(words) | |||
ner_tags = ner_tagger(words) | |||
i = 0 | |||
# Adding tags to words in dictionary, which will be exported as a json file | # Adding tags to words in dictionary, which will be exported as a json file | ||
# {'item 0' : {'word' : word, 'tagger 1': value 1}} | # {'item 0' : {'word' : word, 'tagger 1': value 1}} | ||
for item, value in words_and_tags.items(): | for item, value in words_and_tags.items(): | ||
word = words_and_tags[item]['word'] | word = words_and_tags[item]['word'] | ||
# POS | |||
pos_tag = POS_tags[i] | |||
words_and_tags[item]['POS'] = pos_tag | |||
# Add sentiment tag | |||
#sentiment_tag = sentiment_tagger(word) | |||
#words_and_tags[item]['sentiment'] = sentiment_tag | |||
sentiment_tag = sentiment_tags[i] | |||
words_and_tags[item]['sentiment'] = sentiment_tag | |||
# Named Entity Recognition | |||
ner_tag = ner_tags[i] | |||
words_and_tags[item]['named entity'] = ner_tag | |||
# Move to the next word in the tokenized words dictionary | |||
i = i+1 | |||
# Add stopword tag | # Add stopword tag | ||
stopword_tag = stopword_tagger(word) | stopword_tag = stopword_tagger(word) |
Revision as of 12:10, 27 March 2018
Reading the Structure
What does it do?
Reading the Structure attempts to make visible to human readers how machines, or to be more precise, the implementation of the Bag-of-words model for text analysis, interpretd texts. In the model the sentences are cut into loose words. Then each word can be labelled for importance, sentiment, or its function in the sentence.
During this process of structuring the text for the software, the relation with the original text fades away.
Reading the Structure is a reading interface that brings the labels back in the original text. Based on a certain label - like noun, neutral, or location - words are hidden. Does that makes us, mere humans, able to read like our machines do and indeed read the structure?
The results are visible on screen, and can be exported to posters in PDF format. Besides that, user can download the structured data in a json to use it in their own project.
Background of the concept
Structure of the software
Structuring and tagging the text in Python 3
Input
Using the recipe in the make file, the text in ocr/output.txt is sent to the script. The script reads the standard in, and will tokenize the string using the function provided by NLTK. Now the words are part of a list.
Than, each word gets an ID derived from its index number in the list. This ID and the word will be added to a dictionary with the structure {ID:{word:'word'}}. By using an ID, I am able to store multiple instances of the same word in the dictionary. This is relevant for commonly used words like "the" and "is".
By using another dictionary as the value, I can store multiple pieces of information under each ID. In an earlier version I used the word as the key for the dictionary, but I ran into trouble because python dictionaries only support unique keys.
In the
# == INPUT AND TOKENIZE == # Define input, tokenize and safe tokens to dictionary. Use index as ID for each word. input = stdin.read() words = nltk.word_tokenize(input) words_and_tags = {index : {'word':word} for index , word in enumerate(words)} print(words_and_tags)
Tagging functions
The words go through multiple taggers: functions that label words for Part-of-Speech, sentiment, stopwords and named entities. An example of the stopword tagger can be found below. I choose a structure of separate functions for each tagger, because I wanted to practice with the use of functions. Besides that, I can now easily reuse these functions in other projects.
In every case the output is an individual word, or the list of tokenized words. The output is an individual tag, or a list of tags.
# === 3. Stopword tagger === # Labels words on being a keyword or a stopword, based on the list in the NLTK corpus def stopword_tagger(word): stopWords = set(stopwords.words('english')) if word in stopWords: stopword_tag = 'stopword' else: stopword_tag = 'keyword' return stopword_tag
Adding tags to the dictionary
After defining the functions, I run a for loop for the value words in the dictionary I made in the first step. New labels made by the taggers, are added as values in the dictionary.
I found out that some taggers gave a different output when I used a list of tokens as an input, instead of one word. This is because some taggers, like the NLTK POS tagger and the Sentiment.vader tagger, use the label of the previous token as an extra source of information.
In this case I run the function for all words at once. In the for loop I connect the items in this list to the right keys in the dictionary using their index, which is the same as the ID in the dictionary keys.
POS_tags = POS_tagger(words) sentiment_tags = sentiment_tagger(words) ner_tags = ner_tagger(words) i = 0 # Adding tags to words in dictionary, which will be exported as a json file # {'item 0' : {'word' : word, 'tagger 1': value 1}} for item, value in words_and_tags.items(): word = words_and_tags[item]['word'] # POS pos_tag = POS_tags[i] words_and_tags[item]['POS'] = pos_tag # Add sentiment tag #sentiment_tag = sentiment_tagger(word) #words_and_tags[item]['sentiment'] = sentiment_tag sentiment_tag = sentiment_tags[i] words_and_tags[item]['sentiment'] = sentiment_tag # Named Entity Recognition ner_tag = ner_tags[i] words_and_tags[item]['named entity'] = ner_tag # Move to the next word in the tokenized words dictionary i = i+1 # Add stopword tag stopword_tag = stopword_tagger(word) words_and_tags[item]['wordtype'] = stopword_tag
In the end, one entry in the dictionary looks like this:
"1": {"word": "salary", "POS": "noun", "sentiment": "neutral", "named entity": "no entity", "wordtype": "keyword"}