User:Joca/Software Joca

Reading the Structure

What does it do?

reading interface

Poster and json outputs

Reading the Structure is an attempt to visualize to human readers how machines, or to be more precise, the implementation of the Bag-of-words model for text analysis, interprets texts. In the model the sentences are cut into loose words. Then each word can be labelled for importance, sentiment, or its function in the sentence.

During this process of structuring the text for the software, the relation with the original text fades away.

Reading the Structure is a reading interface that brings the labels back in the original text. Based on a certain label - like noun, neutral, or location - words are hidden. Does that makes us, mere humans, able to read like our machines do and indeed read the structure?

The results are visible on screen, and can be exported to posters in PDF format. Besides that, user can download the structured data in a json to use it in their own project.

Static copy of reading interface generated for a text

Background

In A bag but is language nothing of words Michael Murtaugh writes about the bag of words, a method in which algorithms process a text by cutting the sentences into loose words. Derived from their context, each word is represented and classified in a data structure.

In the article Murtaugh refers to a TED Talk of Tim Berners-Lee where he demands online publishers to give 'unadulterated' data. By having raw information, it would be easier for software to structure this unstructured data. But as he concludes in the end of the essay, language in itself has a structure. Assuming that text is unstructured until software has processed it, means ignoring how text is used outside the field of computer science.

The aspect of 'breaking' the text to algorithmically analyze it got my interest. After processing the text, the relation of original text and the words in the data structure fades away. On the other hand, the text and the processed data represent the same information. Only the formatting and the hierarchy of information is different. Because of this I was interested in bringing the processed words back in the order of the original text.

The idea of mapping the data about a situation, back onto the real situation in a 1:1 scale, is something appears various times in literature. In Sylvie & Bruno, Lewis Carroll writes about the 1:1 map of the world:

“And then came the grandest idea of all! We actually made a map of the country, on the scale of a mile to the mile!” “Have you used it much?” I enquired. “It has never been spread out, yet,” said Mein Herr. “The farmers objected: they said it would cover the whole country, and shut out the sunlight! So we now use the country itself, as its own map, and I assure you it does nearly as well.”

In Klont (2017), Maxime Februari writes about the relation of the reality and it's copy made of structured data:

"Do you know what datafication is? Datafication: the translation of the reality into data. You get a person and translate her into personal data. You get a patient and translate him into medical data. Je convert money into financial data, warfare into location data, reading a book into data on reading behaviour and data on the content. In the end you reduce the reality into data about the reality. When you are done, you got a clone of the reality. But it's not a clone, it's an image of the world that is composed of data about the world. But in the end, data is something inherently different than the world itself."

The reading interface created by my script plays with the original text and its data copy. Sometimes the results are funny, when hiding words changes the message of the text. At other instances it helps in uncovering a particular aspect of the text structure, for example the use of sentiment related words in relation to the total text. I hope that by showing the processed text in a way that resembles the original text, people get to see how an algorithm reads the text, including the errors and inconsistencies that are part of the output of the natural language processing.

Structure of the software

Structuring and tagging the text in Python 3

Input

Using the recipe in the make file, the text in ocr/output.txt is sent to the script. The script reads the standard in, and will tokenize the string using the function provided by NLTK. Now the words are part of a list.

Than, each word gets an ID derived from its index number in the list. This ID and the word will be added to a dictionary with the structure {ID:{word:'word'}}. By using an ID, I am able to store multiple instances of the same word in the dictionary. This is relevant for commonly used words like "the" and "is".

By using another dictionary as the value, I can store multiple pieces of information under each ID. In an earlier version I used the word as the key for the dictionary, but I ran into trouble because python dictionaries only support unique keys.

In the

# == INPUT AND TOKENIZE ==
# Define input, tokenize and safe tokens to dictionary. Use index as ID for each word.
input = stdin.read()
words = nltk.word_tokenize(input)
words_and_tags = {index : {'word':word} for index , word in enumerate(words)}
print(words_and_tags)

Tagging functions

The words go through multiple taggers: functions that label words for Part-of-Speech, sentiment, stopwords and named entities. An example of the stopword tagger can be found below. I choose a structure of separate functions for each tagger, because I wanted to practice with the use of functions. Besides that, I can now easily reuse these functions in other projects.

In every case the output is an individual word, or the list of tokenized words. The output is an individual tag, or a list of tags.

# === 3. Stopword tagger ===
# Labels words on being a keyword or a stopword, based on the list in the NLTK corpus
def stopword_tagger(word):

    stopWords = set(stopwords.words('english'))

    if word in stopWords:
        stopword_tag = 'stopword'
    else:
        stopword_tag = 'keyword'

    return stopword_tag

Adding tags to the dictionary

After defining the functions, I run a for loop for the value words in the dictionary I made in the first step. New labels made by the taggers, are added as values in the dictionary.

I found out that some taggers gave a different output when I used a list of tokens as an input, instead of one word. This is because some taggers, like the NLTK POS tagger and the Sentiment.vader tagger, use the label of the previous token as an extra source of information.

In this case I run the function for all words at once. In the for loop I connect the items in this list to the right keys in the dictionary using their index, which is the same as the ID in the dictionary keys.

POS_tags = POS_tagger(words)
sentiment_tags = sentiment_tagger(words)
ner_tags = ner_tagger(words)
i = 0

# Adding tags to words in dictionary, which will be exported as a json file
# {'item 0' : {'word' : word, 'tagger 1': value 1}}
for item, value in words_and_tags.items():
    word = words_and_tags[item]['word']

    # POS
    pos_tag = POS_tags[i]
    words_and_tags[item]['POS'] = pos_tag

    # Add sentiment tag
    #sentiment_tag = sentiment_tagger(word)
    #words_and_tags[item]['sentiment'] = sentiment_tag
    sentiment_tag = sentiment_tags[i]
    words_and_tags[item]['sentiment'] = sentiment_tag

    # Named Entity Recognition
    ner_tag = ner_tags[i]
    words_and_tags[item]['named entity'] = ner_tag

    # Move to the next word in the tokenized words dictionary
    i = i+1

    # Add stopword tag
    stopword_tag = stopword_tagger(word)
    words_and_tags[item]['wordtype'] = stopword_tag

In the end, one entry in the dictionary looks like this:

"1": {"word": "salary", "POS": "noun", "sentiment": "neutral", "named entity": "no entity", "wordtype": "keyword"}

Generating a html interface using Jinja2

Jinja2 is a python library that can generate an html page using data from a python script. It uses templates with tags derived from django. The basic set-up of the reading interface is a container with a number of spans. Each span contains one word and its tags. See the example below:

{% for item, value in words_and_tags.items() %}

      <span id="{{item}}" class="wrapper {{ words_and_tags[item]['wordtype'] }} {{ words_and_tags[item]['sentiment'] }} {{ words_and_tags[item]['POS'] }} {% if words_and_tags[item]['named entity'] == 'no entity' %} no_entity {% else %} known_entity {% endif %}">
          <span class ="tag ner invisible"> {{ words_and_tags[item]['named entity'] }}</span>
          <span class ="tag wordtype invisible"> {{ words_and_tags[item]['wordtype'] }} </span>
          <span class ="tag sentiment invisible"> {{ words_and_tags[item]['sentiment'] }}</span>
          <span class ="tag pos invisible"> {{ words_and_tags[item]['POS'] }}</span>
          <span class ="word {% if words_and_tags[item]['word'] in [',','.','(',')',';',':'] %} punctuation {% else %} {{ words_and_tags[item]['word'] }} {% endif %}"> {{ words_and_tags[item]['word'] }}</span>
      </span>

{% endfor %}

After rendering the template, the span of one word looks like this:

<span id="3" class="wrapper keyword neutral noun  no_entity ">
          <span class="tag ner invisible"> no entity</span>
          <span class="tag wordtype invisible"> keyword </span>
          <span class="tag sentiment invisible"> neutral</span>
          <span class="tag pos invisible"> noun</span>
          <span class="word  women "> women</span>
      </span>

The index.html features all words and their labels. Each collection of words and labels is inside a .wrapper with an id.

The classes of these wrapper feature the word, and the values of the labels (e.g. class="wrapper software noun keyword neutral"). By default all words inside the wrapper have the class .word. This class is visible. All labels (noun, neutral etc.) have the class .tag and another class with the type of label (POS, sentiment, etc). .tag is by default display:/none;

If the user clicks on one of the .wrapper elements, the page changes the hidden words. The state changes.

What happens inside the wrapper if the state changes to hide all nouns?

- Previous filter is disabled. All tags invisible. All words visible.

- the words in the wrapper with class noun are selected. They get the class word_label. Which means: only visible on :hovering the wrapper.

- the span with the text 'noun' and class 'pos' will lose the class invisible. The tag is now visible in the text.

// State 6 Selectors for the sentiment tagger, showing only positive and negative words and their labels
  var neutral_word = $('.neutral > .word');
  var neutral_label = $('.neutral > .sentiment');

  // On page load, prepare the right view for state one. Hiding all stopwords, showing the stopword label
  var state = 1;
  stopword_word.addClass('word_label');
  stopword_label.removeClass('invisible');

  $("#poster").prop("href", "poster_stopword.pdf")

  // Here we run through the states
  $('.container').click( function() {
    console.log(state);

    if (state == 1) {
      stopword_word.removeClass('word_label');
      stopword_label.addClass('invisible');

      $("#poster").prop("href", "poster_neutral.pdf")

      neutral_word.addClass('word_label');
      neutral_label.removeClass('invisible');
    }

Exporting json and pdf

Using the json library of Python, I export the dictionary with all the words and tags to a .json file. To generate the posters I use Weasyprint. Because I can't use javascript to make specific words visible or invisible for Weasyprint, I use a custom stylesheet for each variation of the poster. In this stylesheet I specify the size of the poster, its background color, and which words are replaced by their labels.

The print-*.css files are mostly the same, apart from line 21 to 34 in the example below.

@page {
  /* dimensions for the whole page */
  size: 297mm 420mm;
  margin: 7rem 2rem 7rem 2rem;
  background-color: #d9d9d6; /* To emulate #dfdfdf on print at PZI Canon MFP */
  position: absolute;
  display: table;

  @bottom-center {
    content: 'make reading_structure';
    font-family: 'Ubuntu Mono', monospace;
    white-space: pre;
    color: #000;
    padding-bottom: 3em;
  }

}

/* ---

 ELEMENTS TO CHANGE HIDDEN WORDS AND ACCENT COLOR

 --- */

span.word_label, .noun .word {
  display: block;
  opacity: 0;
  width: 100%;
  font-size: 1rem;
}

.noun > .pos {
  color: #003cb3;
}