User:Joca/Software Joca

From XPUB & Lens-Based wiki

Reading the Structure

What does it do?

Reading structure screen interface.png

Reading the Structure attempts to make visible to human readers how machines, or to be more precise, the implementation of the Bag-of-words model for text analysis, interpretd texts. In the model the sentences are cut into loose words. Then each word can be labelled for importance, sentiment, or its function in the sentence.

During this process of structuring the text for the software, the relation with the original text fades away.

Reading the Structure is a reading interface that brings the labels back in the original text. Based on a certain label - like noun, neutral, or location - words are hidden. Does that makes us, mere humans, able to read like our machines do and indeed read the structure?

The results are visible on screen, and can be exported to posters in PDF format. Besides that, user can download the structured data in a json to use it in their own project.

Background of the concept

Structure of the software

Structuring and tagging the text in Python 3

Input

Using the recipe in the make file, the text in ocr/output.txt is sent to the script. The script reads the standard in, and will tokenize the string using the function provided by NLTK. Now the words are part of a list.

Than, each word gets an ID derived from its index number in the list. This ID and the word will be added to a dictionary with the structure {ID:{word:'word'}}. By using an ID, I am able to store multiple instances of the same word in the dictionary. This is relevant for commonly used words like "the" and "is".

By using another dictionary as the value, I can store multiple pieces of information under each ID. In an earlier version I used the word as the key for the dictionary, but I ran into trouble because python dictionaries only support unique keys.

In the

# == INPUT AND TOKENIZE ==
# Define input, tokenize and safe tokens to dictionary. Use index as ID for each word.
input = stdin.read()
words = nltk.word_tokenize(input)
words_and_tags = {index : {'word':word} for index , word in enumerate(words)}
print(words_and_tags)

Tagging functions

The words go through multiple taggers: functions that label words for Part-of-Speech, sentiment, stopwords and named entities. An example of the stopword tagger can be found below. I choose a structure of separate functions for each tagger, because I wanted to practice with the use of functions. Besides that, I can now easily reuse these functions in other projects.

In every case the output is an individual word, or the list of tokenized words. The output is an individual tag, or a list of tags.

# === 3. Stopword tagger ===
# Labels words on being a keyword or a stopword, based on the list in the NLTK corpus
def stopword_tagger(word):

    stopWords = set(stopwords.words('english'))

    if word in stopWords:
        stopword_tag = 'stopword'
    else:
        stopword_tag = 'keyword'

    return stopword_tag

Adding tags to the dictionary

After defining the functions, I run a for loop for the value words in the dictionary I made in the first step. New labels made by the taggers, are added as values in the dictionary.

# Adding tags to words in dictionary, which will be exported as a json file
# {'item 0' : {'word' : word, 'tagger 1': value 1}}

for item, value in words_and_tags.items():
    word = words_and_tags[item]['word']
    # Add stopword tag
    stopword_tag = stopword_tagger(word)
    words_and_tags[item]['wordtype'] = stopword_tag

In the end, one entry in the dictionary looks like this:

"1": {"word": "salary", "POS": "noun", "sentiment": "neutral", "named entity": "no entity", "wordtype": "keyword"}

Generating a html interface using Jinja2

Exporting json and pdf

Choosing a license

https://stackoverflow.com/questions/8580223/using-python-module-on-lgpl-license-in-commercial-product