User:Joca/python-experiments: Difference between revisions
No edit summary |
No edit summary |
||
(3 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
On this page I highlight two iterations of the script I made for this Special Issue. | |||
== Word Tagger V1 == | == Word Tagger V1 == | ||
This script reads an input text, tokenized the words and runs a Part-of-Speech tagger. The tags are changed into human readable equivalents, which are saved in a list. The script joins the list items into a string, which is printed in the terminal. | This script reads an input text, tokenized the words and runs a Part-of-Speech tagger. The tags are changed into human readable equivalents, which are saved in a list. The script joins the list items into a string, which is printed in the terminal. | ||
Line 76: | Line 78: | ||
== Wordtagger V2 == | == Wordtagger V2 == | ||
Based on V1, Wordtagger V2 tags the text for Part-of-Speech, stopwords and sentiments. I chose to have each tagger as a separate function in which words are the input, and tags are the output. These outputs are saved in a python dictionary. Based on the output an html page is generated using jinja2. Using javascript and data attributes, the content is swapped after a click by the user. I presented this version at the beta launch at Varia. Based on the feedback I changed the way the words and tags are visualized in the reading interface, to improve the readability. | |||
[https://madebyjoca.com/xpub/wordtagger/index.html Example of the output >>] | |||
<syntaxhighlight lang="python" line='line'> | |||
# LIBS | |||
import nltk | |||
import json | |||
import os | |||
from sys import stdin, stdout | |||
from nltk import ne_chunk, pos_tag, word_tokenize | |||
from nltk.sentiment.vader import SentimentIntensityAnalyzer | |||
from nltk.corpus import stopwords | |||
from jinja2 import Template | |||
# == INPUT AND TOKENIZE == | |||
# Define input, tokenize and safe tokens to dictionary. Use index as ID for each word. | |||
input = stdin.read() | |||
words = nltk.word_tokenize(input) | |||
words_and_tags = {'item ' + str(index) : {'word':word} for index , word in enumerate(words)} | |||
print(words_and_tags) | |||
# == FILTER FUNCTIONS == | |||
# === 1. POS_tagger & Named Entity Recognizer === | |||
# This function cuts a string into words. Then runs a POS tagger for each word. Returns a list with tags | |||
def POS_tagger(list): | |||
taggedwordlist = nltk.pos_tag(list) | |||
for word, pos in nltk.pos_tag(list): | |||
taggedwordlist = nltk.pos_tag(list) | |||
#print('{0} is a {1}'.format(word,pos)) # Comment out to print the analysis step | |||
print(taggedwordlist) | |||
taglist = [ pos for word,pos in taggedwordlist ] | |||
POS_tags = [] | |||
for tag in taglist: | |||
if tag in {"NNP","NNS","NN","NNPS"}: | |||
POS_tag = 'noun' | |||
elif tag in {'VB','VBD','VBG','VBN','VBP','VBZ'}: | |||
POS_tag = 'verb' | |||
elif tag in {'RB','RBR','RBS','WRB'}: | |||
POS_tag = 'adverb' | |||
elif tag in {'PRP','PRP$'}: | |||
POS_tag = 'pronoun' | |||
elif tag in {'JJ','JJR','JJS'}: | |||
POS_tag = 'adjective' | |||
elif tag == 'IN': | |||
POS_tag = 'preposition' | |||
elif tag == 'WDT': | |||
POS_tag = 'determiner' | |||
elif tag in {'WP','WP$'}: | |||
POS_tag = 'pronoun' | |||
elif tag == 'UH': | |||
POS_tag = 'interjection' | |||
elif tag == 'POS': | |||
POS_tag = 'possesive ending' | |||
elif tag == 'SYM': | |||
POS_tag = 'symbol' | |||
elif tag == 'EX': | |||
POS_tag = 'existential there' | |||
elif tag == 'DT': | |||
POS_tag = 'determiner' | |||
elif tag == 'MD': | |||
POS_tag = 'modal' | |||
elif tag == 'LS': | |||
POS_tag = 'list item marker' | |||
elif tag == 'FW': | |||
POS_tag = 'foreign word' | |||
elif tag == 'CC': | |||
POS_tag = 'coordinating conjunction ' | |||
elif tag == 'CD': | |||
POS_tag = 'cardinal number' | |||
elif tag == 'TO': | |||
POS_tag = 'to' | |||
elif tag == '.': | |||
POS_tag = 'line ending' | |||
elif tag == ',': | |||
POS_tag = 'comma' | |||
else: | |||
POS_tag = tag | |||
POS_tags.append(POS_tag) | |||
#print(POS_tag) | |||
return POS_tags; | |||
# === 2. Sentiment tagger === | |||
# Sentiment analyzer based on the NLTK VADER tagger. | |||
# This function uses words as an input. It tags each word based on its sentiment: negative, neutral or positive | |||
def sentiment_tagger(word): | |||
analyzer = SentimentIntensityAnalyzer() | |||
score = analyzer.polarity_scores(word).get("compound") | |||
if score < 0: | |||
sentiment_tag = 'negative' | |||
elif score > 0: | |||
sentiment_tag = 'positive' | |||
else: | |||
sentiment_tag = 'neutral' | |||
return sentiment_tag | |||
# === 3. Stopword tagger === | |||
# Labels words on being a keyword or a stopword, based on the list in the NLTK corpus | |||
def stopword_tagger(word): | |||
stopWords = set(stopwords.words('english')) | |||
if word in stopWords: | |||
stopword_tag = 'stopword' | |||
else: | |||
stopword_tag = 'keyword' | |||
return stopword_tag | |||
# Run POS tagger | |||
# This tagger outputs a list for all items in the dict at once | |||
# To avoid double work, it is better to keep this outside the for loop | |||
POS_tags = POS_tagger(words) | |||
i = 0 | |||
# Adding tags to words in dictionary, which will be exported as a json file | |||
# {'item 0' : {'word' : word, 'tagger 1': value 1}} | |||
for item, value in words_and_tags.items(): | |||
word = words_and_tags[item]['word'] | |||
# POS | |||
pos_tag = POS_tags[i] | |||
words_and_tags[item]['POS'] = pos_tag | |||
i = i+1 | |||
# Add sentiment tag | |||
sentiment_tag = sentiment_tagger(word) | |||
words_and_tags[item]['sentiment'] = sentiment_tag | |||
# Add stopword tag | |||
stopword_tag = stopword_tagger(word) | |||
words_and_tags[item]['wordtype'] = stopword_tag | |||
# Add entity tag | |||
# Not functional yet | |||
# Save data into a json file | |||
print(words_and_tags) | |||
#with open("data.json", 'w') as f: | |||
with open(os.path.dirname(os.path.dirname(os.path.dirname( __file__ ))) + "output/wordtagger/data.json", 'w') as f: | |||
json.dump(words_and_tags, f, ensure_ascii=False) | |||
#let's bind it to a jinja2 template | |||
# Jinja moves up one level by default, so I do not need to do it myself as in line 141 | |||
template_open = open("src/wordtagger/template.html", "r") | |||
template = Template(template_open.read()) | |||
index_render = template.render(words_and_tags=words_and_tags) | |||
#print(text_render) | |||
# And render an html file! | |||
print(index_render) | |||
index_open = open("output/wordtagger/index.html", "w") | |||
index_open.write(index_render) | |||
index_open.close() | |||
</syntaxhighlight> | |||
Excerpt from the Jinja template: | |||
<syntaxhighlight lang="html4strict" line='line'> | |||
<div class="container"><p> | |||
{% for item, value in words_and_tags.items() %} | |||
<span id="{{item}}" class="word {{words_and_tags[item]['sentiment']}} {{words_and_tags[item]['wordtype']}} {{words_and_tags[item]['POS']}}" | |||
data-txt="{{ words_and_tags[item]['word'] }}" | |||
data-pos="{{words_and_tags[item]['POS']}}" {% if words_and_tags[item]['word'] in [',','.','(',')'] %} | |||
data-sentiment= "{{ words_and_tags[item]['word'] }}" {% else %} data-sentiment= '{{ words_and_tags[item]['sentiment'] }}' {% endif %} | |||
{% if words_and_tags[item]['wordtype'] == 'stopword' %} data-stopword= "stopword" {% else %} data-stopword= '{{ words_and_tags[item]['word'] }}' {% endif %} | |||
> | |||
{{words_and_tags[item]['POS']}} | |||
</span> | |||
{% endfor %} | |||
</p></div> | |||
</syntaxhighlight> | |||
Excerpt from the javascript used to swap the content of the span with the data attributes: | |||
<syntaxhighlight lang="html4strict" line='line'> | |||
$('.word').each(function() { | |||
var el = $(this); | |||
if (state == 0) { | |||
el.empty(); | |||
el.html(el.data("stopword") + " "); | |||
} | |||
else if (state == 1) { | |||
el.empty(); | |||
el.html(el.data("sentiment") + " "); | |||
} | |||
else { | |||
el.empty(); | |||
el.html(el.data("pos") + " "); | |||
} | |||
}); | |||
state = state+1; | |||
</syntaxhighlight> | |||
== Wordtagger V3 == | |||
Wordtagger V3 is the script I presented as part of this Special Issue using the name [[User:Joca/Software_Joca|Reading the Structure]]. You can read more about it on the dedicated wikipage. Major differences with V2 are a new way of constructing the html page, making use of spans instead of data attributes. This made it possible to print the page using Weasyprint. I added named entity recognition and I fixed an error, in which I was creating duplicate dictionary values resulting in data loss. |
Latest revision as of 10:04, 28 March 2018
On this page I highlight two iterations of the script I made for this Special Issue.
Word Tagger V1
This script reads an input text, tokenized the words and runs a Part-of-Speech tagger. The tags are changed into human readable equivalents, which are saved in a list. The script joins the list items into a string, which is printed in the terminal.
import nltk
# Step 1: define input and set up a list
input = 'input/kittler.txt'
taggedwordlist = []
txtfile = open(input, 'r')
string = txtfile.read()
words = nltk.word_tokenize(string)
taggedwordlist = nltk.pos_tag(words)
for word, pos in nltk.pos_tag(words):
taggedwordlist = nltk.pos_tag(words)
print('{0} is a {1}'.format(word,pos))
taglist = [ pos for word,pos in taggedwordlist ]
#print(taglist)
readabletaglist = []
for tag in taglist:
if tag in {"NNP","NNS","NN","NNPS"}:
readabletag = 'noun'
elif tag in {'VB','VBD','VBG','VBN','VBP','VBZ'}:
readabletag = 'verb'
elif tag in {'RB','RBR','RBS','WRB'}:
readabletag = 'adverb'
elif tag in {'PRP','PRP$'}:
readabletag = 'pronoun'
elif tag in {'JJ','JJR','JJS'}:
readabletag = 'adjective'
elif tag == 'IN':
readabletag = 'preposition'
elif tag == 'WDT':
readabletag = 'determiner'
elif tag in {'WP','WP$'}:
readabletag = 'pronoun'
elif tag == 'UH':
readabletag = 'interjection'
elif tag == 'POS':
readabletag = 'possesive ending'
elif tag == 'SYM':
readabletag = 'symbol'
elif tag == 'EX':
readabletag = 'existential there'
elif tag == 'DT':
readabletag = 'determiner'
elif tag == 'MD':
readabletag = 'modal'
elif tag == 'LS':
readabletag = 'list item marker'
elif tag == 'FW':
readabletag = 'foreign word'
elif tag == 'CC':
readabletag = 'coordinating conjunction '
elif tag == 'CD':
readabletag = 'cardinal number'
elif tag == 'TO':
readabletag = 'to'
elif tag == '.':
readabletag = 'line ending'
elif tag == ',':
readabletag = 'comma'
else:
readabletag = tag
readabletaglist.append(readabletag)
print(' '.join(readabletaglist))
Wordtagger V2
Based on V1, Wordtagger V2 tags the text for Part-of-Speech, stopwords and sentiments. I chose to have each tagger as a separate function in which words are the input, and tags are the output. These outputs are saved in a python dictionary. Based on the output an html page is generated using jinja2. Using javascript and data attributes, the content is swapped after a click by the user. I presented this version at the beta launch at Varia. Based on the feedback I changed the way the words and tags are visualized in the reading interface, to improve the readability.
# LIBS
import nltk
import json
import os
from sys import stdin, stdout
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from jinja2 import Template
# == INPUT AND TOKENIZE ==
# Define input, tokenize and safe tokens to dictionary. Use index as ID for each word.
input = stdin.read()
words = nltk.word_tokenize(input)
words_and_tags = {'item ' + str(index) : {'word':word} for index , word in enumerate(words)}
print(words_and_tags)
# == FILTER FUNCTIONS ==
# === 1. POS_tagger & Named Entity Recognizer ===
# This function cuts a string into words. Then runs a POS tagger for each word. Returns a list with tags
def POS_tagger(list):
taggedwordlist = nltk.pos_tag(list)
for word, pos in nltk.pos_tag(list):
taggedwordlist = nltk.pos_tag(list)
#print('{0} is a {1}'.format(word,pos)) # Comment out to print the analysis step
print(taggedwordlist)
taglist = [ pos for word,pos in taggedwordlist ]
POS_tags = []
for tag in taglist:
if tag in {"NNP","NNS","NN","NNPS"}:
POS_tag = 'noun'
elif tag in {'VB','VBD','VBG','VBN','VBP','VBZ'}:
POS_tag = 'verb'
elif tag in {'RB','RBR','RBS','WRB'}:
POS_tag = 'adverb'
elif tag in {'PRP','PRP$'}:
POS_tag = 'pronoun'
elif tag in {'JJ','JJR','JJS'}:
POS_tag = 'adjective'
elif tag == 'IN':
POS_tag = 'preposition'
elif tag == 'WDT':
POS_tag = 'determiner'
elif tag in {'WP','WP$'}:
POS_tag = 'pronoun'
elif tag == 'UH':
POS_tag = 'interjection'
elif tag == 'POS':
POS_tag = 'possesive ending'
elif tag == 'SYM':
POS_tag = 'symbol'
elif tag == 'EX':
POS_tag = 'existential there'
elif tag == 'DT':
POS_tag = 'determiner'
elif tag == 'MD':
POS_tag = 'modal'
elif tag == 'LS':
POS_tag = 'list item marker'
elif tag == 'FW':
POS_tag = 'foreign word'
elif tag == 'CC':
POS_tag = 'coordinating conjunction '
elif tag == 'CD':
POS_tag = 'cardinal number'
elif tag == 'TO':
POS_tag = 'to'
elif tag == '.':
POS_tag = 'line ending'
elif tag == ',':
POS_tag = 'comma'
else:
POS_tag = tag
POS_tags.append(POS_tag)
#print(POS_tag)
return POS_tags;
# === 2. Sentiment tagger ===
# Sentiment analyzer based on the NLTK VADER tagger.
# This function uses words as an input. It tags each word based on its sentiment: negative, neutral or positive
def sentiment_tagger(word):
analyzer = SentimentIntensityAnalyzer()
score = analyzer.polarity_scores(word).get("compound")
if score < 0:
sentiment_tag = 'negative'
elif score > 0:
sentiment_tag = 'positive'
else:
sentiment_tag = 'neutral'
return sentiment_tag
# === 3. Stopword tagger ===
# Labels words on being a keyword or a stopword, based on the list in the NLTK corpus
def stopword_tagger(word):
stopWords = set(stopwords.words('english'))
if word in stopWords:
stopword_tag = 'stopword'
else:
stopword_tag = 'keyword'
return stopword_tag
# Run POS tagger
# This tagger outputs a list for all items in the dict at once
# To avoid double work, it is better to keep this outside the for loop
POS_tags = POS_tagger(words)
i = 0
# Adding tags to words in dictionary, which will be exported as a json file
# {'item 0' : {'word' : word, 'tagger 1': value 1}}
for item, value in words_and_tags.items():
word = words_and_tags[item]['word']
# POS
pos_tag = POS_tags[i]
words_and_tags[item]['POS'] = pos_tag
i = i+1
# Add sentiment tag
sentiment_tag = sentiment_tagger(word)
words_and_tags[item]['sentiment'] = sentiment_tag
# Add stopword tag
stopword_tag = stopword_tagger(word)
words_and_tags[item]['wordtype'] = stopword_tag
# Add entity tag
# Not functional yet
# Save data into a json file
print(words_and_tags)
#with open("data.json", 'w') as f:
with open(os.path.dirname(os.path.dirname(os.path.dirname( __file__ ))) + "output/wordtagger/data.json", 'w') as f:
json.dump(words_and_tags, f, ensure_ascii=False)
#let's bind it to a jinja2 template
# Jinja moves up one level by default, so I do not need to do it myself as in line 141
template_open = open("src/wordtagger/template.html", "r")
template = Template(template_open.read())
index_render = template.render(words_and_tags=words_and_tags)
#print(text_render)
# And render an html file!
print(index_render)
index_open = open("output/wordtagger/index.html", "w")
index_open.write(index_render)
index_open.close()
Excerpt from the Jinja template:
<div class="container"><p>
{% for item, value in words_and_tags.items() %}
<span id="{{item}}" class="word {{words_and_tags[item]['sentiment']}} {{words_and_tags[item]['wordtype']}} {{words_and_tags[item]['POS']}}"
data-txt="{{ words_and_tags[item]['word'] }}"
data-pos="{{words_and_tags[item]['POS']}}" {% if words_and_tags[item]['word'] in [',','.','(',')'] %}
data-sentiment= "{{ words_and_tags[item]['word'] }}" {% else %} data-sentiment= '{{ words_and_tags[item]['sentiment'] }}' {% endif %}
{% if words_and_tags[item]['wordtype'] == 'stopword' %} data-stopword= "stopword" {% else %} data-stopword= '{{ words_and_tags[item]['word'] }}' {% endif %}
>
{{words_and_tags[item]['POS']}}
</span>
{% endfor %}
</p></div>
Excerpt from the javascript used to swap the content of the span with the data attributes:
$('.word').each(function() {
var el = $(this);
if (state == 0) {
el.empty();
el.html(el.data("stopword") + " ");
}
else if (state == 1) {
el.empty();
el.html(el.data("sentiment") + " ");
}
else {
el.empty();
el.html(el.data("pos") + " ");
}
});
state = state+1;
Wordtagger V3
Wordtagger V3 is the script I presented as part of this Special Issue using the name Reading the Structure. You can read more about it on the dedicated wikipage. Major differences with V2 are a new way of constructing the html page, making use of spans instead of data attributes. This made it possible to print the page using Weasyprint. I added named entity recognition and I fixed an error, in which I was creating duplicate dictionary values resulting in data loss.