User:Joca/Software Joca: Difference between revisions

Revision as of 14:16, 27 March 2018

Reading the Structure

What does it do?

Reading the Structure attempts to make visible to human readers how machines, or to be more precise, the implementation of the Bag-of-words model for text analysis, interpretd texts. In the model the sentences are cut into loose words. Then each word can be labelled for importance, sentiment, or its function in the sentence.

During this process of structuring the text for the software, the relation with the original text fades away.

Reading the Structure is a reading interface that brings the labels back in the original text. Based on a certain label - like noun, neutral, or location - words are hidden. Does that makes us, mere humans, able to read like our machines do and indeed read the structure?

The results are visible on screen, and can be exported to posters in PDF format. Besides that, user can download the structured data in a json to use it in their own project.

Background of the concept

Structure of the software

Structuring and tagging the text in Python 3

Input

Using the recipe in the make file, the text in ocr/output.txt is sent to the script. The script reads the standard in, and will tokenize the string using the function provided by NLTK. Now the words are part of a list.

Than, each word gets an ID derived from its index number in the list. This ID and the word will be added to a dictionary with the structure {ID:{word:'word'}}. By using an ID, I am able to store multiple instances of the same word in the dictionary. This is relevant for commonly used words like "the" and "is".

By using another dictionary as the value, I can store multiple pieces of information under each ID. In an earlier version I used the word as the key for the dictionary, but I ran into trouble because python dictionaries only support unique keys.

In the

# == INPUT AND TOKENIZE ==
# Define input, tokenize and safe tokens to dictionary. Use index as ID for each word.
input = stdin.read()
words = nltk.word_tokenize(input)
words_and_tags = {index : {'word':word} for index , word in enumerate(words)}
print(words_and_tags)

Tagging functions

The words go through multiple taggers: functions that label words for Part-of-Speech, sentiment, stopwords and named entities. An example of the stopword tagger can be found below. I choose a structure of separate functions for each tagger, because I wanted to practice with the use of functions. Besides that, I can now easily reuse these functions in other projects.

In every case the output is an individual word, or the list of tokenized words. The output is an individual tag, or a list of tags.

# === 3. Stopword tagger ===
# Labels words on being a keyword or a stopword, based on the list in the NLTK corpus
def stopword_tagger(word):

    stopWords = set(stopwords.words('english'))

    if word in stopWords:
        stopword_tag = 'stopword'
    else:
        stopword_tag = 'keyword'

    return stopword_tag

Adding tags to the dictionary

After defining the functions, I run a for loop for the value words in the dictionary I made in the first step. New labels made by the taggers, are added as values in the dictionary.

I found out that some taggers gave a different output when I used a list of tokens as an input, instead of one word. This is because some taggers, like the NLTK POS tagger and the Sentiment.vader tagger, use the label of the previous token as an extra source of information.

In this case I run the function for all words at once. In the for loop I connect the items in this list to the right keys in the dictionary using their index, which is the same as the ID in the dictionary keys.

POS_tags = POS_tagger(words)
sentiment_tags = sentiment_tagger(words)
ner_tags = ner_tagger(words)
i = 0

# Adding tags to words in dictionary, which will be exported as a json file
# {'item 0' : {'word' : word, 'tagger 1': value 1}}
for item, value in words_and_tags.items():
    word = words_and_tags[item]['word']

    # POS
    pos_tag = POS_tags[i]
    words_and_tags[item]['POS'] = pos_tag

    # Add sentiment tag
    #sentiment_tag = sentiment_tagger(word)
    #words_and_tags[item]['sentiment'] = sentiment_tag
    sentiment_tag = sentiment_tags[i]
    words_and_tags[item]['sentiment'] = sentiment_tag

    # Named Entity Recognition
    ner_tag = ner_tags[i]
    words_and_tags[item]['named entity'] = ner_tag

    # Move to the next word in the tokenized words dictionary
    i = i+1

    # Add stopword tag
    stopword_tag = stopword_tagger(word)
    words_and_tags[item]['wordtype'] = stopword_tag

In the end, one entry in the dictionary looks like this:

"1": {"word": "salary", "POS": "noun", "sentiment": "neutral", "named entity": "no entity", "wordtype": "keyword"}

Generating a html interface using Jinja2

Jinja2 is a python library that can generate an html page using data from a python script. It uses templates with tags derived from django. To make the reader, I create a container with spans. Each span contains one word and its tags. See the example below:

{% for item, value in words_and_tags.items() %}

      <span id="{{item}}" class="wrapper {{ words_and_tags[item]['wordtype'] }} {{ words_and_tags[item]['sentiment'] }} {{ words_and_tags[item]['POS'] }} {% if words_and_tags[item]['named entity'] == 'no entity' %} no_entity {% else %} known_entity {% endif %}">
          <span class ="tag ner invisible"> {{ words_and_tags[item]['named entity'] }}</span>
          <span class ="tag wordtype invisible"> {{ words_and_tags[item]['wordtype'] }} </span>
          <span class ="tag sentiment invisible"> {{ words_and_tags[item]['sentiment'] }}</span>
          <span class ="tag pos invisible"> {{ words_and_tags[item]['POS'] }}</span>
          <span class ="word {% if words_and_tags[item]['word'] in [',','.','(',')',';',':'] %} punctuation {% else %} {{ words_and_tags[item]['word'] }} {% endif %}"> {{ words_and_tags[item]['word'] }}</span>
      </span>

{% endfor %}

After rendering the template, the span of one word looks like this:

<span id="3" class="wrapper keyword neutral noun  no_entity ">
          <span class="tag ner invisible"> no entity</span>
          <span class="tag wordtype invisible"> keyword </span>
          <span class="tag sentiment invisible"> neutral</span>
          <span class="tag pos invisible"> noun</span>
          <span class="word  women "> women</span>
      </span>

The index.html features all words and their labels. Each collection of words and labels is inside a .wrapper with an id.

The classes of these wrapper feature the word, and the values of the labels (e.g. class="wrapper software noun keyword neutral"). By default all words inside the wrapper have the class .word. This class is visible. All labels (noun, neutral etc.) have the class .tag and another class with the type of label (POS, sentiment, etc). .tag is by default display:/none;

If the user clicks on one of the .wrapper elements, the page changes the hidden words. The state changes.

What happens inside the wrapper if the state changes to hide all nouns?

- Previous filter is disabled. All tags invisible. All words visible. - the words in the wrapper with class noun are selected. They get the class word_label. Which means: only visible on :hovering the wrapper. - the span with the text 'noun' and class 'pos' will lose the class invisible. The tag is now visible in the text.

// State 6 Selectors for the sentiment tagger, showing only positive and negative words and their labels
  var neutral_word = $('.neutral > .word');
  var neutral_label = $('.neutral > .sentiment');

  // On page load, prepare the right view for state one. Hiding all stopwords, showing the stopword label
  var state = 1;
  stopword_word.addClass('word_label');
  stopword_label.removeClass('invisible');

  $("#poster").prop("href", "poster_stopword.pdf")

  // Here we run through the states
  $('.container').click( function() {
    console.log(state);

    if (state == 1) {
      stopword_word.removeClass('word_label');
      stopword_label.addClass('invisible');

      $("#poster").prop("href", "poster_neutral.pdf")

      neutral_word.addClass('word_label');
      neutral_label.removeClass('invisible');
    }

Exporting json and pdf

Choosing a license

https://stackoverflow.com/questions/8580223/using-python-module-on-lgpl-license-in-commercial-product

@@ Line 126: / Line 126: @@
 </pre>
-Every span has a collection of other spans inside. These spans contain the word, and its labels. Using jquery, words that have certain tags can be hidden, or made visible. After each click by the user, the state of the script changes and a number of css classes are added and removed. An excerpt of the script can be seen below. Currently the script makes use of jquery, but it might be better to rewrite it to javascript. The interactions are quite simple, and by having the script in vanilla javascript I don't have to include the jquery library anymore.
+The index.html features all words and their labels. Each collection of words and labels is inside a .wrapper with an id.
+The classes of these wrapper feature the word, and the values of the labels (e.g. class="wrapper software noun keyword neutral"). By default all words inside the wrapper have the class .word. This class is visible. All labels (noun, neutral etc.) have the class .tag and another class with the type of label (POS, sentiment, etc). .tag is by default <code>display:/none;</code>
+If the user clicks on one of the .wrapper elements, the page changes the hidden words. The state changes.
+What happens inside the wrapper if the state changes to hide all nouns?
+- Previous filter is disabled. All tags invisible. All words visible.
+- the words in the wrapper with class noun are selected. They get the class word_label. Which means: only visible on :hovering the wrapper.
+- the span with the text 'noun' and class 'pos' will lose the class invisible. The tag is now visible in the text.
 <pre>