User:Joca/Software Joca: Difference between revisions

Revision as of 13:01, 27 March 2018

Reading the Structure

What does it do?

Reading the Structure attempts to make visible to human readers how machines, or to be more precise, the implementation of the Bag-of-words model for text analysis, interpretd texts. In the model the sentences are cut into loose words. Then each word can be labelled for importance, sentiment, or its function in the sentence.

During this process of structuring the text for the software, the relation with the original text fades away.

Reading the Structure is a reading interface that brings the labels back in the original text. Based on a certain label - like noun, neutral, or location - words are hidden. Does that makes us, mere humans, able to read like our machines do and indeed read the structure?

The results are visible on screen, and can be exported to posters in PDF format. Besides that, user can download the structured data in a json to use it in their own project.

Background of the concept

Structure of the software

Structuring and tagging the text in Python 3

Input

Using the recipe in the make file, the text in ocr/output.txt is sent to the script. The script reads the standard in, and will tokenize the string using the function provided by NLTK. Now the words are part of a list.

Than, each word gets an ID derived from its index number in the list. This ID and the word will be added to a dictionary with the structure {ID:{word:'word'}}. By using an ID, I am able to store multiple instances of the same word in the dictionary. This is relevant for commonly used words like "the" and "is".

By using another dictionary as the value, I can store multiple pieces of information under each ID. In an earlier version I used the word as the key for the dictionary, but I ran into trouble because python dictionaries only support unique keys.

In the

# == INPUT AND TOKENIZE ==
# Define input, tokenize and safe tokens to dictionary. Use index as ID for each word.
input = stdin.read()
words = nltk.word_tokenize(input)
words_and_tags = {index : {'word':word} for index , word in enumerate(words)}
print(words_and_tags)

Tagging functions

The words go through multiple taggers: functions that label words for Part-of-Speech, sentiment, stopwords and named entities. An example of the stopword tagger can be found below. I choose a structure of separate functions for each tagger, because I wanted to practice with the use of functions. Besides that, I can now easily reuse these functions in other projects.

In every case the output is an individual word, or the list of tokenized words. The output is an individual tag, or a list of tags.

# === 3. Stopword tagger ===
# Labels words on being a keyword or a stopword, based on the list in the NLTK corpus
def stopword_tagger(word):

    stopWords = set(stopwords.words('english'))

    if word in stopWords:
        stopword_tag = 'stopword'
    else:
        stopword_tag = 'keyword'

    return stopword_tag

Adding tags to the dictionary

After defining the functions, I run a for loop for the value words in the dictionary I made in the first step. New labels made by the taggers, are added as values in the dictionary.

I found out that some taggers gave a different output when I used a list of tokens as an input, instead of one word. This is because some taggers, like the NLTK POS tagger and the Sentiment.vader tagger, use the label of the previous token as an extra source of information.

In this case I run the function for all words at once. In the for loop I connect the items in this list to the right keys in the dictionary using their index, which is the same as the ID in the dictionary keys.

POS_tags = POS_tagger(words)
sentiment_tags = sentiment_tagger(words)
ner_tags = ner_tagger(words)
i = 0

# Adding tags to words in dictionary, which will be exported as a json file
# {'item 0' : {'word' : word, 'tagger 1': value 1}}
for item, value in words_and_tags.items():
    word = words_and_tags[item]['word']

    # POS
    pos_tag = POS_tags[i]
    words_and_tags[item]['POS'] = pos_tag

    # Add sentiment tag
    #sentiment_tag = sentiment_tagger(word)
    #words_and_tags[item]['sentiment'] = sentiment_tag
    sentiment_tag = sentiment_tags[i]
    words_and_tags[item]['sentiment'] = sentiment_tag

    # Named Entity Recognition
    ner_tag = ner_tags[i]
    words_and_tags[item]['named entity'] = ner_tag

    # Move to the next word in the tokenized words dictionary
    i = i+1

    # Add stopword tag
    stopword_tag = stopword_tagger(word)
    words_and_tags[item]['wordtype'] = stopword_tag

In the end, one entry in the dictionary looks like this:

"1": {"word": "salary", "POS": "noun", "sentiment": "neutral", "named entity": "no entity", "wordtype": "keyword"}

Generating a html interface using Jinja2

Jinja2 is a python library that can generate an html page using data from a python script. It uses templates with tags derived from django. To make the reader, I create a container with spans. Each span contains one word and its tags. See the example below:

{% for item, value in words_and_tags.items() %}

      <span id="{{item}}" class="wrapper {{ words_and_tags[item]['wordtype'] }} {{ words_and_tags[item]['sentiment'] }} {{ words_and_tags[item]['POS'] }} {% if words_and_tags[item]['named entity'] == 'no entity' %} no_entity {% else %} known_entity {% endif %}">
          <span class ="tag ner invisible"> {{ words_and_tags[item]['named entity'] }}</span>
          <span class ="tag wordtype invisible"> {{ words_and_tags[item]['wordtype'] }} </span>
          <span class ="tag sentiment invisible"> {{ words_and_tags[item]['sentiment'] }}</span>
          <span class ="tag pos invisible"> {{ words_and_tags[item]['POS'] }}</span>
          <span class ="word {% if words_and_tags[item]['word'] in [',','.','(',')',';',':'] %} punctuation {% else %} {{ words_and_tags[item]['word'] }} {% endif %}"> {{ words_and_tags[item]['word'] }}</span>
      </span>

{% endfor %}

After rendering the template, the span of one word looks like this:

<span id="3" class="wrapper keyword neutral noun  no_entity ">
          <span class="tag ner invisible"> no entity</span>
          <span class="tag wordtype invisible"> keyword </span>
          <span class="tag sentiment invisible"> neutral</span>
          <span class="tag pos invisible"> noun</span>
          <span class="word  women "> women</span>
      </span>

// State 6 Selectors for the sentiment tagger, showing only positive and negative words and their labels
  var neutral_word = $('.neutral > .word');
  var neutral_label = $('.neutral > .sentiment');

  // On page load, prepare the right view for state one. Hiding all stopwords, showing the stopword label
  var state = 1;
  stopword_word.addClass('word_label');
  stopword_label.removeClass('invisible');

  $("#poster").prop("href", "poster_stopword.pdf")

  // Here we run through the states
  $('.container').click( function() {
    console.log(state);

    if (state == 1) {
      stopword_word.removeClass('word_label');
      stopword_label.addClass('invisible');

      $("#poster").prop("href", "poster_neutral.pdf")

      neutral_word.addClass('word_label');
      neutral_label.removeClass('invisible');
    }

Exporting json and pdf

Choosing a license

https://stackoverflow.com/questions/8580223/using-python-module-on-lgpl-license-in-commercial-product