User:Kimberley
Vernacular Language Processing (SI16)
Transcription
Selection Process
"Annotation Compass"
"Cloverleaf"
Cloverleaf is a tool to navigate a set of text. Through generated short-cuts, it is meant to interrupt the linearity of a text. The result is a collage of excerpts and aims to free unexpected reading paths. The tool can be used to stitch various voices together in a non-hierarchical manner, giving off hybrid constructions where common points and divergences can co-exist.
in detail
For two texts in a set of text following an index order, let’s consider a ‘preceding text’ and its succeeding. In a first place, the function bridge()
will look for the first identical word occurring in both texts (excluding stop words). Let’s name the position (index) of this word ‘i’ for the preceding text and ‘j’ for the succeeding text:
text, index 0: “Strawberries don’t grow tasty (i) in the Netherlands.” text, index 1: “Pineapple is very tasty (j) with salt (i) and chilli powder.” text, index 2: “Blocks of salt (j) distract cows (i).” text, index 3: “There was many field with cows (j) in this area.” Result: "Strawberries don’t grow tasty with salt distract cows in this area."
Since every text, in a given set of at least four texts, will alternatively take the ‘preceding’ and the ‘succeeding’ position, each text will hold a word indexed as ‘i’ and a word indexed as ‘j’: marking the identical words occurring between a text and its succeeding. These marks will then determine the start and the end of each excerpt, and open the ‘shortcut’ aforementioned.
As a result, the preceding text will be printed from its index (j)—attributed formerly when this text was in a ’succeeding’ position—until (i), its common word with its current succeeding text. The function will loop until the last two texts of the set (in the index order).
TEXT 1 xxxxxxxxx(J)oooooooooooo(I)xxxxxxxxxxxxx TEXT 2 xxxxxxxxxxxxxxxxx(J)oooooooooooo(I)xxxxxxxxxx TEXT 3 xxxxxxxxxxx(J)ooooooo(I)xxxxxxxxxxxxxxxx TEXT 4 xxxxxx(J)oooooooooooooooo(I)xxxxxxxxxxxxxx o = printed text x = rejected text J = same word's index preceding text I = same word's index in succeeding text
In the case no match is found between two texts, the text in succeeding position will be printed from its first word to its last.
"Cloverleaf" and the "Annotation Compass"
Cloverleaf was imagined to navigate the annotations gathered with the Annotation Compass and offers the possibility to process its ever-growing data (json file). Used in complementarity, they become a proliferative environment for collective writing.
Collecting Json file
from nltk.corpus import stopwords
sw = stopwords.words("english")
from urllib.request import urlopen
import json
resultSentences = []
labels_corpus = []
url = f"https://hub.xpub.nl/soupboat/generic-labels/get-labels/?image=think-classify7.jpg"
response = urlopen(url)
data_json = json.loads(response.read())
labels = data_json['labels']
for label in labels:
sent = label['text'].split()
labels_corpus.append(sent)
print(labels_corpus)
In a first experiment, a group of eight persons was given the following instruction:
"For as long as one minute; you are invited to define, in your own words and logic, the verb on the screen. This definition does not necessarily have to have sense for anyone else but you, although English language will be our common ground in this experiment.
Important: You are required to write for the entire span of this granted minute! Do not lift hands from keyboard, and in case of blockage you are welcome to press any key or write any word, onomatopoeia, etc.
For this experiment, each participant will only fill in one ‘insert’ box per word to be defined."
The terms suggested to define were directly taken from George Perec's "Vocabulary exercises" ("Think/Classify", 1985):
- "arrange,
- catalogue,
- cut up,
- divide,
- enumerate,
- gather,
- grade,
- group,
- list,
- number,
- order,
- organise,
- sort"
Bridge()
# The arguments in this functions are 2 texts (text_a and text_b) an index for where the text_a starts and an index for where it ends.
def bridge(text_a, text_b, start_a, isLast):
matchFound = 0
start_next = 0
# for index i in text_a from a given index until the end of text_a
for i in range(start_a, len(text_a)):
if matchFound:
break
# we name word_a the index i in text_a
word_a = text_a[i]
# if word_a is not in the given list of stopwords:
if word_a not in sw:
# for index j in the entire text_b:
for j in range(0, len(text_b)):
# we name word_b the word with index j in text_b
word_b = text_b[j]
# if word_a equals to word_b:
if clean_word(word_a) == clean_word(word_b):
# resultSentences is a list to which the following informations will add up:
resultSentences.append({
'text': text_a,
'start': start_a,
'end': i,
'hasMatch': 1
})
# if the text in position text_a is the last text to be compared:
# the same informations as above will be added, except that there will be no index for its end.
if isLast:
resultSentences.append({
'text': text_b,
'start': j,
'end': None,
'hasMatch': 1
})
# after the match is found between the 2 texts, the function will break
matchFound = 1
start_next = j
break
if matchFound == 0:
resultSentences.append({
'text': text_a,
'start': start_a,
'end': None,
'hasMatch': 0
})
if isLast:
resultSentences.append({
'text': text_b,
'start': 0,
'end': None,
'hasMatch': 0
})
# the function returns the index of the 'same word' in the text_b
return start_next
def bridge_list(corpus):
start_a = 0
result = ""
#for all texts indexes within the corpus to be compared:
for text_index in range(0, len(corpus)-1):
# the last text_a to be compared has to be the text indexed as corpus[-2];
# the last text_b will then be the last text of the corpus (corpus[-1]).
isLast = text_index == len(corpus)-2
# text_a is a given index of the corpus and text_b is the following index
text_a = corpus[text_index]
text_b = corpus[text_index + 1]
#start_a is the index (in text_b) of the first 'common word' between text_a and text_b;
#start_a is the starting point to compare a text and its following (in index order within the corpus);
start_next = bridge(text_a, text_b, start_a, isLast)
start_a = start_next