User:Joca/python-experiments: Difference between revisions

From XPUB & Lens-Based wiki
No edit summary
No edit summary
Line 1: Line 1:
On this page I highlight two iterations of the script I made for this Special Issue.
== Word Tagger V1 ==
== Word Tagger V1 ==
This script reads an input text, tokenized the words and runs a Part-of-Speech tagger. The tags are changed into human readable equivalents, which are saved in a list. The script joins the list items into a string, which is printed in the terminal.
This script reads an input text, tokenized the words and runs a Part-of-Speech tagger. The tags are changed into human readable equivalents, which are saved in a list. The script joins the list items into a string, which is printed in the terminal.

Latest revision as of 10:04, 28 March 2018

On this page I highlight two iterations of the script I made for this Special Issue.

Word Tagger V1

This script reads an input text, tokenized the words and runs a Part-of-Speech tagger. The tags are changed into human readable equivalents, which are saved in a list. The script joins the list items into a string, which is printed in the terminal.

import nltk

# Step 1: define input and set up a list
input = 'input/kittler.txt'
taggedwordlist = []

txtfile = open(input, 'r')
string =
words = nltk.word_tokenize(string)
taggedwordlist = nltk.pos_tag(words)

for word, pos in nltk.pos_tag(words):
    taggedwordlist = nltk.pos_tag(words)
    print('{0} is a {1}'.format(word,pos))

taglist = [ pos for word,pos in taggedwordlist ]


readabletaglist = []

for tag in taglist:
    if tag in {"NNP","NNS","NN","NNPS"}:
        readabletag = 'noun'
    elif tag in {'VB','VBD','VBG','VBN','VBP','VBZ'}:
        readabletag = 'verb'
    elif tag in {'RB','RBR','RBS','WRB'}:
        readabletag = 'adverb'
    elif tag in {'PRP','PRP$'}:
        readabletag = 'pronoun'
    elif tag in {'JJ','JJR','JJS'}:
        readabletag = 'adjective'
    elif tag == 'IN':
        readabletag = 'preposition'
    elif tag == 'WDT':
        readabletag = 'determiner'
    elif tag in {'WP','WP$'}:
        readabletag = 'pronoun'
    elif tag == 'UH':
        readabletag = 'interjection'
    elif tag == 'POS':
        readabletag = 'possesive ending'
    elif tag == 'SYM':
        readabletag = 'symbol'
    elif tag == 'EX':
        readabletag = 'existential there'
    elif tag == 'DT':
        readabletag = 'determiner'
    elif tag == 'MD':
        readabletag = 'modal'
    elif tag == 'LS':
        readabletag = 'list item marker'
    elif tag == 'FW':
        readabletag = 'foreign word'
    elif tag == 'CC':
        readabletag = 'coordinating conjunction '
    elif tag == 'CD':
        readabletag = 'cardinal number'
    elif tag == 'TO':
        readabletag = 'to'
    elif tag == '.':
        readabletag = 'line ending'
    elif tag == ',':
        readabletag = 'comma'
        readabletag = tag


print(' '.join(readabletaglist))

Wordtagger V2

Based on V1, Wordtagger V2 tags the text for Part-of-Speech, stopwords and sentiments. I chose to have each tagger as a separate function in which words are the input, and tags are the output. These outputs are saved in a python dictionary. Based on the output an html page is generated using jinja2. Using javascript and data attributes, the content is swapped after a click by the user. I presented this version at the beta launch at Varia. Based on the feedback I changed the way the words and tags are visualized in the reading interface, to improve the readability.

Example of the output >>

import nltk
import json
import os
from sys import stdin, stdout
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from jinja2 import Template

# Define input, tokenize and safe tokens to dictionary. Use index as ID for each word.
input =
words = nltk.word_tokenize(input)
words_and_tags = {'item ' + str(index) : {'word':word} for index , word in enumerate(words)}


# === 1. POS_tagger & Named Entity Recognizer ===
# This function cuts a string into words. Then runs a POS tagger for each word. Returns a list with tags
def POS_tagger(list):
    taggedwordlist = nltk.pos_tag(list)

    for word, pos in nltk.pos_tag(list):
        taggedwordlist = nltk.pos_tag(list)
        #print('{0} is a {1}'.format(word,pos)) # Comment out to print the analysis step
    taglist = [ pos for word,pos in taggedwordlist ]
    POS_tags = []

    for tag in taglist:
        if tag in {"NNP","NNS","NN","NNPS"}:
            POS_tag = 'noun'
        elif tag in {'VB','VBD','VBG','VBN','VBP','VBZ'}:
            POS_tag = 'verb'
        elif tag in {'RB','RBR','RBS','WRB'}:
            POS_tag = 'adverb'
        elif tag in {'PRP','PRP$'}:
            POS_tag = 'pronoun'
        elif tag in {'JJ','JJR','JJS'}:
            POS_tag = 'adjective'
        elif tag == 'IN':
            POS_tag = 'preposition'
        elif tag == 'WDT':
            POS_tag = 'determiner'
        elif tag in {'WP','WP$'}:
            POS_tag = 'pronoun'
        elif tag == 'UH':
            POS_tag = 'interjection'
        elif tag == 'POS':
            POS_tag = 'possesive ending'
        elif tag == 'SYM':
            POS_tag = 'symbol'
        elif tag == 'EX':
            POS_tag = 'existential there'
        elif tag == 'DT':
            POS_tag = 'determiner'
        elif tag == 'MD':
            POS_tag = 'modal'
        elif tag == 'LS':
            POS_tag = 'list item marker'
        elif tag == 'FW':
            POS_tag = 'foreign word'
        elif tag == 'CC':
            POS_tag = 'coordinating conjunction '
        elif tag == 'CD':
            POS_tag = 'cardinal number'
        elif tag == 'TO':
            POS_tag = 'to'
        elif tag == '.':
            POS_tag = 'line ending'
        elif tag == ',':
            POS_tag = 'comma'
            POS_tag = tag
    return POS_tags;

# === 2. Sentiment tagger ===
# Sentiment analyzer based on the NLTK VADER tagger.
# This function uses words as an input. It tags each word based on its sentiment: negative, neutral or positive
def sentiment_tagger(word):
    analyzer = SentimentIntensityAnalyzer()
    score = analyzer.polarity_scores(word).get("compound")

    if score < 0:
        sentiment_tag = 'negative'
    elif score > 0:
        sentiment_tag = 'positive'
        sentiment_tag = 'neutral'

    return sentiment_tag

# === 3. Stopword tagger ===
# Labels words on being a keyword or a stopword, based on the list in the NLTK corpus
def stopword_tagger(word):

    stopWords = set(stopwords.words('english'))

    if word in stopWords:
        stopword_tag = 'stopword'
        stopword_tag = 'keyword'

    return stopword_tag

# Run POS tagger
# This tagger outputs a list for all items in the dict at once
# To avoid double work, it is better to keep this outside the for loop
POS_tags = POS_tagger(words)
i = 0

# Adding tags to words in dictionary, which will be exported as a json file
# {'item 0' : {'word' : word, 'tagger 1': value 1}}
for item, value in words_and_tags.items():
    word = words_and_tags[item]['word']

    # POS
    pos_tag = POS_tags[i]
    words_and_tags[item]['POS'] = pos_tag
    i = i+1

    # Add sentiment tag
    sentiment_tag = sentiment_tagger(word)
    words_and_tags[item]['sentiment'] = sentiment_tag

    # Add stopword tag
    stopword_tag = stopword_tagger(word)
    words_and_tags[item]['wordtype'] = stopword_tag

    # Add entity tag
    # Not functional yet

# Save data into a json file
#with open("data.json", 'w') as f:
with open(os.path.dirname(os.path.dirname(os.path.dirname( __file__ ))) + "output/wordtagger/data.json", 'w') as f:
  json.dump(words_and_tags, f, ensure_ascii=False)

#let's bind it to a jinja2 template
# Jinja moves up one level by default, so I do not need to do it myself as in line 141
template_open = open("src/wordtagger/template.html", "r")
template = Template(
index_render = template.render(words_and_tags=words_and_tags)

# And render an html file!
index_open = open("output/wordtagger/index.html", "w")

Excerpt from the Jinja template:

  <div class="container"><p>
     {% for item, value in words_and_tags.items() %}
      <span id="{{item}}" class="word {{words_and_tags[item]['sentiment']}} {{words_and_tags[item]['wordtype']}} {{words_and_tags[item]['POS']}}" 
      data-txt="{{ words_and_tags[item]['word'] }}" 
      data-pos="{{words_and_tags[item]['POS']}}" {% if words_and_tags[item]['word'] in [',','.','(',')'] %} 
      data-sentiment= "{{ words_and_tags[item]['word'] }}" {% else %} data-sentiment= '{{ words_and_tags[item]['sentiment'] }}' {% endif %} 
      {% if words_and_tags[item]['wordtype'] == 'stopword' %} data-stopword= "stopword" {% else %} data-stopword= '{{ words_and_tags[item]['word'] }}' {% endif %}
    {% endfor %}

Excerpt from the javascript used to swap the content of the span with the data attributes:

    $('.word').each(function() {
      var el = $(this);

      if (state == 0) {
        el.html("stopword") + "&nbsp;");

      else if (state == 1) {
        el.html("sentiment") + "&nbsp;");

      else {
        el.html("pos") + "&nbsp;");


    state = state+1;

Wordtagger V3

Wordtagger V3 is the script I presented as part of this Special Issue using the name Reading the Structure. You can read more about it on the dedicated wikipage. Major differences with V2 are a new way of constructing the html page, making use of spans instead of data attributes. This made it possible to print the page using Weasyprint. I added named entity recognition and I fixed an error, in which I was creating duplicate dictionary values resulting in data loss.