User:Tash/Special Issue 05: Difference between revisions

From XPUB & Lens-Based wiki
No edit summary
Line 315: Line 315:
import random  
import random  
import glob
import glob
import time


stopwords.words('english')
stopwords.words('english')
Line 339: Line 338:
r = attribute # 'title' is the word in the html tag
r = attribute # 'title' is the word in the html tag
r, c = r.split(";") # split the attribute into two sections
r, c = r.split(";") # split the attribute into two sections
r = r.split(" ")[1:] # split again and discard the elements which aren't useful
r = r.split(" ")[1:] #split, discard the elements which aren't useful
r = [int(x) for x in r] # put coordinates into list as integers
r = [int(x) for x in r] # put coordinates into list as integers
return r
return r
Line 356: Line 355:
text_file1 = open("input/ocr/001.txt").read()
text_file1 = open("input/ocr/001.txt").read()
print ('1. Processing first scanned image. Learning vocabulary.')
print ('1. Processing first scanned image. Learning vocabulary.')
time.sleep(1)


#tokenize only alphanumeric sequences and clean stopwords
#tokenize only alphanumeric sequences and clean stopwords

Revision as of 12:16, 25 March 2018

Reflections on Book Scanning: A Feminist Reader

Test prints on different paper
Test print on pink paper
Test binding with hardcover and 200gr cover paper
Creating the hardcover with book linen
The final reader

Key questions:

  • How to use feminist methodologies as tool to unravel the known / unknown history of book scanning?
  • What is lost and what is gained when we scan / digitize a book?
  • What culture do we reproduce when we scan?

The last question became the focus of my individual research this trimester. Both my reader and my script are critiques on the way human bias can multiply from medium to medium, especially when the power structures that regulate them remain unchallenged and when knowledge spaces present themselves as universal or immediate.

My research

Key topics:

  • How existing online libraries like Google Books select, structure and use literary works
  • The human bias in all technological processes, algorithms and media
  • Gender and power structures in language, and especially the written word
  • Women's underrepresentation in science, technology and literature
  • The politics of transparency and seamlessness of digital interfaces
  • The male-dominated canon and Western-centric systems of knowledge production and regulation

Being a writer myself, I wanted to explore a feminist critique on the literary canon - on who is included and excluded when it comes to the written world. The documentary on Google Books and the World Brain also sparked questions on who is controlling these processes, and to what end? Technology is not neutral, even less so than science is, as it is primarily concerned with the creation of artefacts. In book scanning (largely seen as the ultimate means of compiling the entirety of human knowledge) it is still people who write the code, select the books to scan and design the interfaces to access them. To separate this labour from its results is to overlook much of the social and political aspect of knowledge production.

As such, my reader questions how human biases and cultural blind spots are transferred from the page to the screen, as companies like Google turn books into databases, bags of words into training sets, and use them in ways we don't all know about. The conclusion is, that if we want to build more inclusive and unbiased knowledge spaces, we have to be more critical of the politics of selection, and as Johanna Drucker said, "call attention to the made-ness of knowledge."

Final list of works included

On the books we upload

  1. The Book: Its Past, Its Future: An Interview with Roger Chartier
  2. Webs of Feminist Knowledge Online by Sanne Koevoets

On the canon which excludes

  1. Feminist Challenges to the Literary Canon by Lillian Robinson
  2. I am a Woman Writer, I am a Western Writer: An Interview with Ursula Le Guin
  3. Merekam Perempuan Penulis Dalam Sejarah Kesusastraan: Wawancara dengan Melani Budianta
  4. Linguistic Sexism and Feminist Linguistic Activism by Anne Pauwels

On what the surface hides

  1. Windows and Mirrors: The Myth of Transparency by Jay Bolter and Diane Gromala
  2. Performative Reality and Theoretical Approaches to Interface by Johanna Drucker
  3. On Being Included by Sara Ahmed

To see my Zotero library: click here

Design & Production

The design of my reader was inspired by the following feminist methodologies:

  • Situated knowledges

The format we chose for the whole reader (each of us making our own unique chapter) reflects the idea that knowledge is inextricable from its context: its author, their worldview, their intentions. This is also the reason why I chose to include a small biography of myself, and to weave my own personal views and annotations throughout the content of my reader.

  • Performative materiality

The diverse formats, materials and designs of all of our readers also means that the scanning process will never be the same twice. It becomes more performative, as decisions will have to be made like which reader to scan first? In which direction (as some text is laid out in different angles)? What will be left out, what will be kept? Again we ask the audience to pay more attention to who is scanning and how things are being scanned.

  • Intersectional feminism

My reader also includes an article written in Indonesian. As an Indonesian artist I am always aware of how Western my education is and has been. I wanted to comment on the fact that a huge percentage of the books that have been scanned today are of Anglo-American origin.

  • Diversity in works

9 out of the 13 authors/interview subjects in my reader are women. I learnt how important and revealing citation lists can be.

As my subject was the literary canon, I decided to design my reader as a traditional hardcover book. I set up the layout following Jan Tschichold's rules of style. Within these typical forms, I decided to make untypical choices, like the use of pink paper and setting all of my annotations on a 90 degree angle. The graphic image on the dust cover was designed to evoke the mutation of media, from page to screen and back.

PDF: File:ReaderNB Final Spreads.pdf


Software

  • Following on from my research for my reader, my central question became how to visualise / play with / emphasize the way cultural biases or blind spots are multiplied from medium to medium
  • other concepts include the echo chamber of the internet, inclusion and visibility of minorities, coded gaze, how design can challenge or perpetuate dominant narratives
  • Important refs:
    • "If the content of the books we scan are exclusive and incomplete, how can we ensure that they are at least distributed and treated as such?" - from my reading of Johanna Drucker's Performative Materiality
    • "Feminist and race theorists over generations have taught us that to inhabit a category of privilege is not to come up against the category... When a category allows us to pass into the world, we might not notice that we inhabit that category. When we are stopped or held up by how we inhabit what we inhabit, then the terms of habitation are revealed to us." - Sara Ahmed, On Being Included (2012)
    • "The past interrupts the present." - Grada Kilomba
    • "Calls for tech inclusion often miss the bias that is embedded in written code. Frustrating experiences with using computer vision code on diverse faces remind me that not all eyes or skin tones are easily recognized with existing code libraries." - Joy Buolamwini (https://medium.com/mit-media-lab/incoding-in-the-beginning-4e2a5c51a45d)


Tests & experiments

Session with Manetta & Cristina on training supervised classifiers: positive vs negative and rational vs emotional, binaries, data sets and protocols Script:

import nltk
import random
import pickle

input_a = 'input/emotional.txt' 
input_b = 'input/rational.txt'

documents = []
all_words = []

def read_input_text(filename, category):
        txtfile = open(filename, 'r')
        string = txtfile.read()
        sentences = nltk.sent_tokenize(string)
        vocabulary = []
        for sentence in sentences:
                words = nltk.word_tokenize(sentence)
                vocabulary.append(words)
                documents.append((words, category))
                for word in words:
                        all_words.append(word.lower())
        return vocabulary

vocabulary_a = read_input_text(input_a, 'emotional')
vocabulary_b = read_input_text(input_b, 'rational')

print('Data size:', len(vocabulary_a))
print('Data size:', len(vocabulary_b))

baseline_a = len(vocabulary_a) / (len(vocabulary_a) + len(vocabulary_b)) * 100
baseline_b = len(vocabulary_b) / (len(vocabulary_a) + len(vocabulary_b)) * 100
print('Baseline: '+str(baseline_a)+'% / '+str(baseline_b)+'%')

random.shuffle(documents)

words = []
for w in all_words:
    w = w.lower()
    words.append(w)

most_freq = nltk.FreqDist(words).most_common(100)

def create_document_features(document): # document=list of words coming from the two input txt files, which have been reshuffled
        if type(document) == str:
                document = nltk.word_tokenize(document)
                
        document_words = set(document) # set = list of unique words
        features = {}

        # add word-count features
        for word, num in most_freq:
                count = document.count(word)
                features['contains({})'.format(word)] = count
                # features['contains({})'.format(word)] = (word in document_words)

        # # add Part-of-Speech features
        # selected_tag_set = ['JJ', 'CC', 'IN', 'DT', 'TO', 'NN', 'AT','RB', 'PRP', 'VB', 'NNP', 'VBZ', 'VBN', '.'] #the feature set
        # tags = []
        # document_pos_items = nltk.pos_tag(document)
        # if document_pos_items:
        #         for word, t in document_pos_items:
        #                 tags.append(t)
        # for tag in selected_tag_set:
        #         features['pos({})'.format(tag)] = tags.count(tag)

        print('\n')
        print(document)
        print(features)
        return features

# *** TRAIN & TEST ***
featuresets = []
for (document, category) in documents:
        features = create_document_features(document)
        featuresets.append((features, category))

# training
train_num = int(len(featuresets) * 0.8) # 80% of your data
test_num = int(len(featuresets) * 0.2)
train_set, test_set = featuresets[:train_num], featuresets[test_num:] # : slices part of the list

classifier = nltk.NaiveBayesClassifier.train(train_set)
classifier.classify(create_document_features('Language is easy to capture but difficult to read.'))

print(nltk.classify.accuracy(classifier, test_set)) # http://www.nltk.org/book/ch06.html#accuracy
# shows how accurate your classifier is, based on the test results
# 0.7666666666666667

# save classifier as .pickle file
# f = open('my_classifier.pickle', 'wb')
# pickle.dump(classifier, f)
# f.close()

Using NLTK to analyse texts and process them, looking at how one scan could affect what was visible in the next

import nltk
from nltk.tokenize import RegexpTokenizer 
from nltk import FreqDist
from nltk.corpus import stopwords
import random

stopwords.words('english')
sr = set(stopwords.words('english'))

##inputting first text file which we want to analyse
text_file = open("input/ocr/001.txt").read()

##tokenize only alphanumeric sequences i.e. ignore punctuation & everything else
tokenizer = RegexpTokenizer(r'\w+')
allwords = tokenizer.tokenize(text_file)
clean_words = allwords[:]
 
for word in allwords:
    if word in sr:
        clean_words.remove(word)

#print ("Clean words without Stopwords or punctuation:", clean_words)

fdist = FreqDist(clean_words)
mostcommon = fdist.most_common(30)
mostcommon_list = [i[0] for i in mostcommon]
print ("Most common words from text 1:", mostcommon_list)

#-------------------------------------------------------------------------------#
##analysing second text file which we want to edit
text_file2 = open("input/ocr/002.txt").read()

##tokenize only alphanumeric sequences i.e. ignore punctuation & everything else
tokenizer = RegexpTokenizer(r'\w+')
allwords2 = tokenizer.tokenize(text_file2)
clean_words2 = allwords2[:]

for word2 in allwords2:
    if word2 in sr:
        clean_words2.remove(word2)

#print ("Clean words without Stopwords or punctuation:", clean_words)

fdist = FreqDist(clean_words2)
leastcommon = fdist.most_common()
leastcommon_list = []

for i in leastcommon:
	if (i[1] == 1):
		leastcommon_list.append(i[0])

print ("Least common words in text file 2", leastcommon_list)


#-------------------------------------------------------------------------------#
##replace least common words from second text file with most common words from first text file
#Empty list which will be used for output:
newtext = []

text2 = text_file2.split()
for x in text2:
	if (x in leastcommon_list):
		#r = (random.choice(mostcommon_list)) 
		newtext.append('-')

	else:
		newtext.append(x)
print ("New text:", " ".join(newtext))


Using HTML5lib to use original scanned image and make outputs using hocr files (disrupting the seamless process!!)

example from Guttormsgaard archive using html5lib
censoring the word 'the' on a page of my reader
import html5lib
from xml.etree import ElementTree as ET 
from PIL import Image
import nltk
from nltk import word_tokenize

iim = Image.open("burroughs-000.tiff")
oim = Image.new("RGB", iim.size, (255, 255, 255))

f = open('burroughs.html')
# t is an "element tree"
t = html5lib.parse(f, namespaceHTMLElements=False)
for s in t.findall(".//span[@class='ocrx_word']"):
	print (ET.tostring(s, encoding="unicode"))
	word = s.text

	# here you are extracting the 'attribute: the box coordinates of each word
	r = s.attrib['title']
	# split the attribute into sections, discard what isn't useful
	r, c = r.split(";")
	r = r.split(" ")[1:]

	# put into list as integers
	r = [int(x) for x in r]

	# use PIL to crop out every box, then paste it according to if rule
	wim = iim.crop(r)

	if len(word) > 5:
		oim.paste((255, 255, 255), (r[0], r[1], r[2], r[3]))
	
	else:
		oim.paste(wim, (r[0], r[1], r[2], r[3]))

wim.save ("wim.png")
oim.save("output-burroughs1.pdf")

Developing the replace script, which is more complicated because it has to relate to 3 files at the same time: the hocr of the page, the initial image of the page, and the cropped word image

deleting least common words
replacing 'you' with 'we'
import html5lib
from xml.etree import ElementTree as ET 
from PIL import Image
import nltk 
from nltk.tokenize import RegexpTokenizer 
from nltk import FreqDist
from nltk.corpus import stopwords
import random 
import glob

stopwords.words('english')
sr = set(stopwords.words('english'))

def cleanstopwords(list):
	"This cleans stopwords from a list of words"
	clean_words = list[:]
	for word in list:
		if word.lower() in sr:
			clean_words.remove(word)
	return clean_words

def findmostcommon(list):
	"This finds the most common words and returns a list"
	fdist = FreqDist(word.lower() for word in list)
	mostcommon = fdist.most_common(30)
	mostcommon_list = [i[0] for i in mostcommon]
	return mostcommon_list

def coordinates(attribute):
	"This extracts the box coordinates of words from an hocr / html element tree"
	r = attribute 	# 'title' is the word in the html tag
	r, c = r.split(";") 	# split the attribute into two sections
	r = r.split(" ")[1:] 	#split, discard the elements which aren't useful
	r = [int(x) for x in r] # put coordinates into list as integers
	return r

def filternone(word_raw):
	if word_raw is None:
		remove = None
		word = 'null'
	else:
		word = s.text.strip(',".!:;')
	return word

#-------------------------------------------------------------------------------#
#inputting first OCR text file which we want to analyse

text_file1 = open("input/ocr/001.txt").read()
print ('1. Processing first scanned image. Learning vocabulary.')

#tokenize only alphanumeric sequences and clean stopwords
tokenizer = RegexpTokenizer(r'\w+')
allwords = tokenizer.tokenize(text_file1)
clean_words = cleanstopwords(allwords)

#find most common words
mostcommon_list = findmostcommon(clean_words)
print ("The most common words in text 1 are:", mostcommon_list)
print ("")
time.sleep(1)

#-------------------------------------------------------------------------------#
#analysing second text file which we want to edit
text_file2 = open("input/ocr/002.txt").read()

#tokenize only alphanumeric sequences and clean stopwords
tokenizer = RegexpTokenizer(r'\w+')
allwords2 = tokenizer.tokenize(text_file2)
clean_words2 = cleanstopwords(allwords2) 

#find least common words
fdist = FreqDist(word.lower() for word in clean_words2)
leastcommon = fdist.most_common()
leastcommon_list = []

for i in leastcommon:
	if (i[1] == 1):
		leastcommon_list.append(i[0])

print ("2. Processing second scanned image.") 
print ("The least common words in text 2 are:", leastcommon_list)

#-------------------------------------------------------------------------------#
#create output images (oim) using initial image (iim) and word image (wim)
print ('3. Extracting coordinates of words.')

n = 0
iim1 = Image.open("img-001.tiff")
oim1 = Image.new("RGB", iim1.size, (255, 255, 255))
a = open("img-001.html")

# collecting most common word images to mostcommonimg folder # t is an "element tree"
t1 = html5lib.parse(a, namespaceHTMLElements=False)
for s in t1.findall(".//span[@class='ocrx_word']"):
	n = n+1
	word = filternone(s.text)

	#extract coordinates
	r = coordinates(s.attrib['title'])

	if word in mostcommon_list:
		r_replace = r
		wimreplace = iim1.crop(r_replace)
			wimreplace.save ("output/mostcommonimg/wimreplace{}.png".format(n))

#-------------------------------------------------------------------------------#
# processing output images

iim2 = Image.open("img-002.tiff")
oim2 = Image.new("RGB", iim2.size, (255, 255, 255))
b = open("img-002.html")

print ('4. Reading second scanned image, filtering least common words.')

# collecting most common word images to mostcommonimg folder # t is an "element tree"
t2 = html5lib.parse(b, namespaceHTMLElements=False)
for s in t2.findall(".//span[@class='ocrx_word']"):
	word = filternone(s.text)

	#extract coordinates
	r = coordinates(s.attrib['title'])

	# use PIL to crop out every box, then paste it according to if rule
	wim = iim2.crop(r)
	wimreplace = random.choice(glob.glob('./output/mostcommonimg/*.png'))
	wimcolor = Image.new('RGBA', wimreplace.size, (255, 255, 0, 1))
	out = Image.alpha_composite(wimreplace, wimcolor)

	if word.lower() in leastcommon_list:
		oim2.paste(out, (r[0], r[1])
	else:
		oim2.paste(wim, (r[0], r[1], r[2], r[3]))

#-------------------------------------------------------------------------------#
# save and image
oim1.save("output/scanimg/output-replace1.png")