User:Tash/Special Issue 05: Difference between revisions
Line 707: | Line 707: | ||
</source> | </source> | ||
[[File:All-outputs.png| | [[File:All-outputs.png|1000px|frameless|center]] |
Revision as of 11:53, 25 March 2018
Reflections on Book Scanning: A Feminist Reader
Key questions:
- How to use feminist methodologies as tool to unravel the known / unknown history of book scanning?
- What is lost and what is gained when we scan / digitize a book?
- What culture do we reproduce when we scan?
The last question became the focus of my individual research this trimester. Both my reader and my script are critiques on the way human bias can multiply from medium to medium, especially when the power structures that regulate them remain unchallenged and when knowledge spaces present themselves as universal or immediate.
My research
Key topics:
- How existing online libraries like Google Books select, structure and use literary works
- The human bias in all technological processes, algorithms and media
- Gender and power structures in language, and especially the written word
- Women's underrepresentation in science, technology and literature
- The politics of transparency and seamlessness of digital interfaces
- The male-dominated canon and Western-centric systems of knowledge production and regulation
Being a writer myself, I wanted to explore a feminist critique on the literary canon - on who is included and excluded when it comes to the written world. The documentary on Google Books and the World Brain also sparked questions on who is controlling these processes, and to what end? Technology is not neutral, even less so than science is, as it is primarily concerned with the creation of artefacts. In book scanning (largely seen as the ultimate means of compiling the entirety of human knowledge) it is still people who write the code, select the books to scan and design the interfaces to access them. To separate this labour from its results is to overlook much of the social and political aspect of knowledge production.
As such, my reader questions how human biases and cultural blind spots are transferred from the page to the screen, as companies like Google turn books into databases, bags of words into training sets, and use them in ways we don't all know about. The conclusion is, that if we want to build more inclusive and unbiased knowledge spaces, we have to be more critical of the politics of selection, and as Johanna Drucker said, "call attention to the made-ness of knowledge."
Final list of works included
On the books we upload
- The Book: Its Past, Its Future: An Interview with Roger Chartier
- Webs of Feminist Knowledge Online by Sanne Koevoets
On the canon which excludes
- Feminist Challenges to the Literary Canon by Lillian Robinson
- I am a Woman Writer, I am a Western Writer: An Interview with Ursula Le Guin
- Merekam Perempuan Penulis Dalam Sejarah Kesusastraan: Wawancara dengan Melani Budianta
- Linguistic Sexism and Feminist Linguistic Activism by Anne Pauwels
On what the surface hides
- Windows and Mirrors: The Myth of Transparency by Jay Bolter and Diane Gromala
- Performative Reality and Theoretical Approaches to Interface by Johanna Drucker
- On Being Included by Sara Ahmed
To see my Zotero library: click here
Design & Production
The design of my reader was inspired by the following feminist methodologies:
- Situated knowledges
The format we chose for the whole reader (each of us making our own unique chapter) reflects the idea that knowledge is inextricable from its context: its author, their worldview, their intentions. This is also the reason why I chose to include a small biography of myself, and to weave my own personal views and annotations throughout the content of my reader.
- Performative materiality
The diverse formats, materials and designs of all of our readers also means that the scanning process will never be the same twice. It becomes more performative, as decisions will have to be made like which reader to scan first? In which direction (as some text is laid out in different angles)? What will be left out, what will be kept? Again we ask the audience to pay more attention to who is scanning and how things are being scanned.
- Intersectional feminism
My reader also includes an article written in Indonesian. As an Indonesian artist I am always aware of how Western my education is and has been. I wanted to comment on the fact that a huge percentage of the books that have been scanned today are of Anglo-American origin.
- Diversity in works
9 out of the 13 authors/interview subjects in my reader are women. I learnt how important and revealing citation lists can be.
As my subject was the literary canon, I decided to design my reader as a traditional hardcover book. I set up the layout following Jan Tschichold's rules of style. Within these typical forms, I decided to make untypical choices, like the use of pink paper and setting all of my annotations on a 90 degree angle. The graphic image on the dust cover was designed to evoke the mutation of media, from page to screen and back.
PDF: File:ReaderNB Final Spreads.pdf
Software
- Following on from my research for my reader, my central question became how to visualise / play with / emphasize the way cultural biases or blind spots are multiplied from medium to medium
- other concepts include the echo chamber of the internet, inclusion and visibility of minorities, coded gaze, how design can challenge or perpetuate dominant narratives
- Important refs:
- "If the content of the books we scan are exclusive and incomplete, how can we ensure that they are at least distributed and treated as such?" - from my reading of Johanna Drucker's Performative Materiality
- "Feminist and race theorists over generations have taught us that to inhabit a category of privilege is not to come up against the category... When a category allows us to pass into the world, we might not notice that we inhabit that category. When we are stopped or held up by how we inhabit what we inhabit, then the terms of habitation are revealed to us." - Sara Ahmed, On Being Included (2012)
- "The past interrupts the present." - Grada Kilomba
- "Calls for tech inclusion often miss the bias that is embedded in written code. Frustrating experiences with using computer vision code on diverse faces remind me that not all eyes or skin tones are easily recognized with existing code libraries." - Joy Buolamwini (https://medium.com/mit-media-lab/incoding-in-the-beginning-4e2a5c51a45d)
Tests & experiments
Session with Manetta & Cristina on training supervised classifiers: positive vs negative and rational vs emotional, binaries, data sets and protocols
Pad with script: https://pad.pzimediadesign.nl/p/OuNuPo-3
Using NLTK to analyse texts and process them, looking at how one scan could affect what was visible in the next
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk import FreqDist
from nltk.corpus import stopwords
import random
stopwords.words('english')
sr = set(stopwords.words('english'))
##inputting first text file which we want to analyse
text_file = open("input/ocr/001.txt").read()
##tokenize only alphanumeric sequences i.e. ignore punctuation & everything else
tokenizer = RegexpTokenizer(r'\w+')
allwords = tokenizer.tokenize(text_file)
clean_words = allwords[:]
for word in allwords:
if word in sr:
clean_words.remove(word)
#print ("Clean words without Stopwords or punctuation:", clean_words)
fdist = FreqDist(clean_words)
mostcommon = fdist.most_common(30)
mostcommon_list = [i[0] for i in mostcommon]
print ("Most common words from text 1:", mostcommon_list)
#-------------------------------------------------------------------------------#
##analysing second text file which we want to edit
text_file2 = open("input/ocr/002.txt").read()
##tokenize only alphanumeric sequences i.e. ignore punctuation & everything else
tokenizer = RegexpTokenizer(r'\w+')
allwords2 = tokenizer.tokenize(text_file2)
clean_words2 = allwords2[:]
for word2 in allwords2:
if word2 in sr:
clean_words2.remove(word2)
#print ("Clean words without Stopwords or punctuation:", clean_words)
fdist = FreqDist(clean_words2)
leastcommon = fdist.most_common()
leastcommon_list = []
for i in leastcommon:
if (i[1] == 1):
leastcommon_list.append(i[0])
print ("Least common words in text file 2", leastcommon_list)
#-------------------------------------------------------------------------------#
##replace least common words from second text file with most common words from first text file
#Empty list which will be used for output:
newtext = []
text2 = text_file2.split()
for x in text2:
if (x in leastcommon_list):
#r = (random.choice(mostcommon_list))
newtext.append('-')
else:
newtext.append(x)
print ("New text:", " ".join(newtext))
Using HTML5lib to use original scanned image and make outputs using hocr files (disrupting the seamless process!!)
import html5lib
from xml.etree import ElementTree as ET
from PIL import Image
import nltk
from nltk import word_tokenize
iim = Image.open("burroughs-000.tiff")
oim = Image.new("RGB", iim.size, (255, 255, 255))
f = open('burroughs.html')
# t is an "element tree"
t = html5lib.parse(f, namespaceHTMLElements=False)
for s in t.findall(".//span[@class='ocrx_word']"):
print (ET.tostring(s, encoding="unicode"))
word = s.text
# here you are extracting the 'attribute: the box coordinates of each word
r = s.attrib['title']
# split the attribute into sections, discard what isn't useful
r, c = r.split(";")
r = r.split(" ")[1:]
# put into list as integers
r = [int(x) for x in r]
# use PIL to crop out every box, then paste it according to if rule
wim = iim.crop(r)
if len(word) > 5:
oim.paste((255, 255, 255), (r[0], r[1], r[2], r[3]))
else:
oim.paste(wim, (r[0], r[1], r[2], r[3]))
wim.save ("wim.png")
oim.save("output-burroughs1.pdf")
Developing the replace script, which is more complicated because it has to relate to 3 files at the same time: the hocr of the page, the initial image of the page, and the cropped word image
import html5lib
from xml.etree import ElementTree as ET
from PIL import Image
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk import FreqDist
from nltk.corpus import stopwords
import random
import glob
stopwords.words('english')
sr = set(stopwords.words('english'))
def cleanstopwords(list):
"This cleans stopwords from a list of words"
clean_words = list[:]
for word in list:
if word.lower() in sr:
clean_words.remove(word)
return clean_words
def findmostcommon(list):
"This finds the most common words and returns a list"
fdist = FreqDist(word.lower() for word in list)
mostcommon = fdist.most_common(30)
mostcommon_list = [i[0] for i in mostcommon]
return mostcommon_list
def coordinates(attribute):
"Extracts the box coordinates of words from an html element tree"
r = attribute # 'title' is the word in the html tag
r, c = r.split(";") # split the attribute into two sections
r = r.split(" ")[1:] #split, discard the elements which aren't useful
r = [int(x) for x in r] # put coordinates into list as integers
return r
def filternone(word_raw):
if word_raw is None:
remove = None
word = 'null'
else:
word = s.text.strip(',".!:;')
return word
#-------------------------------------------------------------------------------#
#inputting first OCR text file which we want to analyse
text_file1 = open("input/ocr/001.txt").read()
print ('1. Processing first scanned image. Learning vocabulary.')
#tokenize only alphanumeric sequences and clean stopwords
tokenizer = RegexpTokenizer(r'\w+')
allwords = tokenizer.tokenize(text_file1)
clean_words = cleanstopwords(allwords)
#find most common words
mostcommon_list = findmostcommon(clean_words)
print ("The most common words in text 1 are:", mostcommon_list)
#-------------------------------------------------------------------------------#
#analysing second text file which we want to edit
text_file2 = open("input/ocr/002.txt").read()
#tokenize only alphanumeric sequences and clean stopwords
tokenizer = RegexpTokenizer(r'\w+')
allwords2 = tokenizer.tokenize(text_file2)
clean_words2 = cleanstopwords(allwords2)
#find least common words
fdist = FreqDist(word.lower() for word in clean_words2)
leastcommon = fdist.most_common()
leastcommon_list = []
for i in leastcommon:
if (i[1] == 1):
leastcommon_list.append(i[0])
print ("2. Processing second scanned image.")
print ("The least common words in text 2 are:", leastcommon_list)
#-------------------------------------------------------------------------------#
#create output images (oim) using initial image (iim) and word image (wim)
print ('3. Extracting coordinates of words.')
n = 0
iim1 = Image.open("img-001.tiff")
oim1 = Image.new("RGB", iim1.size, (255, 255, 255))
a = open("img-001.html")
# collecting most common word images to mostcommonimg folder
t1 = html5lib.parse(a, namespaceHTMLElements=False)
for s in t1.findall(".//span[@class='ocrx_word']"):
n = n+1
word = filternone(s.text)
#extract coordinates
r = coordinates(s.attrib['title'])
if word in mostcommon_list:
r_replace = r
wimreplace = iim1.crop(r_replace)
wimreplace.save ("output/mostcommonimg/wimreplace{}.png".format(n))
#-------------------------------------------------------------------------------#
# processing output images
iim2 = Image.open("img-002.tiff")
oim2 = Image.new("RGB", iim2.size, (255, 255, 255))
b = open("img-002.html")
print ('4. Reading second scanned image, filtering least common words.')
# collecting most common word images to mostcommonimg folder
t2 = html5lib.parse(b, namespaceHTMLElements=False)
for s in t2.findall(".//span[@class='ocrx_word']"):
word = filternone(s.text)
#extract coordinates
r = coordinates(s.attrib['title'])
# use PIL to crop out every box, then paste it according to if rule
wim = iim2.crop(r)
wimreplace = random.choice(glob.glob('./output/mostcommonimg/*.png'))
wimcolor = Image.new('RGBA', wimreplace.size, (255, 255, 0, 1))
out = Image.alpha_composite(wimreplace, wimcolor)
if word.lower() in leastcommon_list:
oim2.paste(out, (r[0], r[1])
else:
oim2.paste(wim, (r[0], r[1], r[2], r[3]))
#-------------------------------------------------------------------------------#
# save and image
oim1.save("output/scanimg/output-replace1.png")
Final codes
Two scripts: Erase / Replace scripts, which are experiments that question who and what is included or excluded in book scanning. In each script, what is first scanned affects what is visible and what is hidden in what is scanned in a second stage, so on so forth. The scripts learn each page's vocabulary and favours the most common words. The least common word recede further and further away from view, finally disappearing all together or even replaced by the more common words. Every scan session results in a different distortion, and outputs the original scanned images (no matter how many pages), but with the text manipulated.
Erase
import html5lib
from xml.etree import ElementTree as ET
from PIL import Image
from nltk import FreqDist
from nltk.corpus import stopwords
import glob
import os
from fpdf import FPDF
stopwords.words('english')
sr = set(stopwords.words('english'))
def cleanstopwords(list):
"This cleans stopwords from a list of words"
clean_words = list[:]
for word in list:
if word.lower() in sr:
clean_words.remove(word)
return clean_words
def findleastcommon(list):
"This finds the least common words and returns a list"
fdist = FreqDist(word.lower() for word in list)
leastcommon = fdist.most_common()
for i in leastcommon:
if (i[1] <= 1):
leastcommon_list.append(i[0])
return leastcommon_list
def coordinates(attribute):
"Extracts the box coordinates of words from an html element tree"
r = attribute # 'title' is the word in the html tag
r, c = r.split(";") # split the attribute into two sections
r = r.split(" ")[1:] # split, discard the elements which aren't useful
r = [int(x) for x in r] # put coordinates into list as integers
return r
def filternone(word_raw):
if word_raw is None:
remove = None
word = 'y'
else:
word = element.text.strip(',".!:;()')
return word
x = -1
leastcommon_list = []
allwords = []
scanimg = glob.glob('images-tiff/*.tiff')
hocr = glob.glob('hocr/*.html')
maximum = 20 / len(scanimg)
# this helps the script remove words in a way that
# is proportional to number of pages scanned
# loop through every image in scanimg folder
for i in scanimg:
x = x + 1
limit = x * maximum
iim = Image.open(i) # iim is initial image
oim = Image.new("RGB", iim.size, (255, 255, 255)) #oim is output image
# open corresponding hocr file
f = open(hocr[x])
print ('Reading scanned image, filtering least common words.')
print ('')
t = html5lib.parse(f, namespaceHTMLElements=False)
# loop through every word in hocr file to analyse words and find least common
for element in t.findall(".//span[@class='ocrx_word']"):
word = filternone(element.text)
allwords.append(word)
clean_words = cleanstopwords(allwords) #clean stopwords
findleastcommon(clean_words) #find least common words and add them to list
print ("The least common words until text", x+1, "are:", leastcommon_list)
print ('')
print ('Processing word coordinates and erasing least common words.')
print ('')
# loop through every word in hocr file to extract coordinates, then remove or paste into output image
for element in t.findall(".//span[@class='ocrx_word']"):
word = filternone(element.text)
c = coordinates(element.attrib['title'])
wim = iim.crop(c) # wim is word image
if word.lower() in leastcommon_list and len(word) < limit:
oim.paste((255, 255, 255), (c[0], c[1], c[2], c[3]))
else:
oim.paste(wim, (c[0], c[1], c[2], c[3]))
#-------------------------------------------------------------------------------#
# save and show images
n = i.replace("images-tiff/","output/erase-replace/").replace(".tiff", "")
oim.save("{}-{}erase.jpg".format(n, x))
#-------------------------------------------------------------------------------#
# save images into PDF
outputs = glob.glob('output/erase-replace/*erase.jpg')
print ("Saving to PDF: output/erase-replace/Erase.pdf")
def makePdf(pdfFileName, listPages, dir = ''):
if (dir):
dir += "/"
cover = Image.open(dir + str(listPages[0]))
width, height = cover.size
pdf = FPDF(unit = "pt", format = [width, height])
for page in listPages:
pdf.add_page()
pdf.image(dir + str(page), 0, 0)
pdf.output(dir + pdfFileName + ".pdf", "F")
makePdf('output/erase-replace/Erase', outputs, dir = '')
#clean up previous jpg files
files = glob.glob('./output/erase-replace/*erase.jpg')
for f in files:
os.remove(f)
Replace
import html5lib
from xml.etree import ElementTree as ET
from PIL import Image
from nltk import FreqDist
from nltk.corpus import stopwords
import random
import glob
import time
from fpdf import FPDF
import os
import shutil
path1 = './temp'
if not os.path.isdir(path1):
os.makedirs(path1)
os.makedirs('./temp/crops4')
os.makedirs('./temp/crops7')
os.makedirs('./temp/crops_more')
stopwords.words('english')
sr = set(stopwords.words('english'))
def cleanstopwords(list):
"This cleans stopwords from a list of words"
clean_words = list[:]
for word in list:
if word.lower() in sr:
clean_words.remove(word)
return clean_words
def findmostcommon(list, int):
"This finds the most common words and returns a list"
fdist = FreqDist(word.lower() for word in list)
mostcommon = fdist.most_common(int)
mostcommon_list = [i[0] for i in mostcommon]
return mostcommon_list
def findleastcommon(list):
"This finds the least common words and returns a list"
fdist = FreqDist(word.lower() for word in list)
leastcommon = fdist.most_common()
for i in leastcommon:
if (i[1] <= 1):
leastcommon_list.append(i[0])
return leastcommon_list
def coordinates(attribute):
"This extracts the box coordinates of words from an html element tree"
c = attribute # 'title' is the word in the html tag
c, r = c.split(";") # split the attribute into two sections
c = c.split(" ")[1:] # split again and discard the elements which aren't useful
c = [int(x) for x in c] # put coordinates into list as integers
return c
def filternone(word_raw):
if word_raw is None:
remove = None
word = 'y'
else:
word = element.text.strip(',".!:;()')
return word
x = -1
leastcommon_list = []
allwords = []
scanimg = glob.glob('images-tiff/*.tiff')
hocr = glob.glob('hocr/*.html')
num = 0
maximum = 20 / len(scanimg)
# this helps the script remove words in a way
# that is proportional to number of pages scanned
# loop through every image in scanimg folder
for i in scanimg:
x = x + 1
limit = 15 - (x * maximum)
iim = Image.open(i) # iim is initial image
oim = Image.new("RGB", iim.size, (255, 255, 255)) #oim is output image
# open corresponding hocr file
f = open(hocr[x])
print ('Reading scanned image and hocr file, filtering least common words.')
print ('')
t = html5lib.parse(f, namespaceHTMLElements=False)
# loop through every word in hocr file to analyse words and find least common
for element in t.findall(".//span[@class='ocrx_word']"):
word = filternone(element.text)
allwords.append(word)
clean_words = cleanstopwords(allwords) #clean stopwords
findleastcommon(clean_words) #find least common words and add them to list
mostcommon_list = findmostcommon(clean_words, 30) #find most common words and add them to list
print ('The most common words until text', x+1, 'are:', mostcommon_list)
print ('The least common words until text', x+1, 'are:', leastcommon_list)
print ('')
# loop through every word in hocr file to extract coordinates, then remove or paste into output image
print ('Processing word coordinates and replacing least common words with most common words.')
print ('')
for element in t.findall(".//span[@class='ocrx_word']"):
word = filternone(element.text)
c = coordinates(element.attrib['title'])
num = num + 1
wim = iim.crop(c) # wim is word image
#extract coordinates
if word.lower() in mostcommon_list and len(word) > 1 and len(word) <= 5:
wim.save ("temp/crops4/wimreplace{}.png".format(num))
elif word in mostcommon_list and len(word) <= 7 :
wim.save ("temp/crops7/wimreplace{}.png".format(num))
elif word in mostcommon_list and len(word) > 7 :
wim.save ("temp/crops_more/wimreplace{}.png".format(num))
if x > 0:
# use PIL to crop out every box, then paste it according to if rule
randomimg4 = random.choice(glob.glob('temp/crops4/*.png'))
randomimg7 = random.choice(glob.glob('temp/crops7/*.png'))
randomimg_more = random.choice(glob.glob('temp/crops_more/*.png'))
wimreplace4 = Image.open(randomimg4)
wimreplace7 = Image.open(randomimg7)
wimreplace_more = Image.open(randomimg_more)
wimcolor4 = Image.new('RGBA', wimreplace4.size, (250, 230, 0, 90))
wimcolor7 = Image.new('RGBA', wimreplace7.size, (250, 230, 0, 90))
wimcolor_more = Image.new('RGBA', wimreplace_more.size, (250, 230, 0, 90))
out4 = Image.alpha_composite(wimreplace4.convert('RGBA'), wimcolor4)
out7 = Image.alpha_composite(wimreplace7.convert('RGBA'), wimcolor7)
out_more = Image.alpha_composite(wimreplace_more.convert('RGBA'), wimcolor_more)
if word.lower() in leastcommon_list and len(word) <= 3:
oim.paste(wim, (c[0], c[1], c[2], c[3]))
elif word.lower() in leastcommon_list and len(word) < 8:
oim.paste(out4, (c[0], c[1]))
elif word.lower() in leastcommon_list and len(word) < 11:
oim.paste(out7, (c[0], c[1]))
elif word.lower() in leastcommon_list and len(word) > 8:
oim.paste(out_more, (c[0], c[1]))
else:
oim.paste(wim, (c[0], c[1], c[2], c[3]))
else:
oim.paste(wim, (c[0], c[1], c[2], c[3]))
#-------------------------------------------------------------------------------#
# save images
n = i.replace("images-tiff/","output/erase-replace/").replace(".tiff", "")
oim.save("{}-{}replace.jpg".format(n, x))
#-------------------------------------------------------------------------------#
# save images into PDF
outputs = glob.glob('output/erase-replace/*replace.jpg')
print ('')
print ("Saving to PDF: output/erase-replace/Replace.pdf")
def makePdf(pdfFileName, listPages, dir = ''):
if (dir):
dir += "/"
cover = Image.open(dir + str(listPages[0]))
width, height = cover.size
pdf = FPDF(unit = "pt", format = [width, height])
for page in listPages:
pdf.add_page()
pdf.image(dir + str(page), 0, 0)
pdf.output(dir + pdfFileName + ".pdf", "F")
makePdf('output/erase-replace/Replace', outputs, dir = '')
#clean up previous jpg and png files
files = glob.glob('./output/erase-replace/*replace.jpg')
for f in files:
os.remove(f)
shutil.rmtree('./temp/')
Makefile & post-processing scripts
tiffs: ## convert images/ to images-tiff/ Depends on IM
echo $(images)
for i in $(images); \
do tiff=`basename $$i .jpg`.tiff; \
convert -density 300 $$i -colorspace RGB -type truecolor -alpha on images-tiff/$$tiff; \
echo $$tiff; \
done;
hocrs: ## hocr with tesseract and then change extension to .html
for i in images-tiff/*.tiff; \
do echo $$i; hocrfile=`basename $$i .tiff`; \
tesseract $$i hocr/$$hocrfile hocr; \
mv hocr/$$hocrfile.hocr hocr/$$hocrfile.html; \
erase: tiffs hocrs ## Natasha: Analyzes pages in order, erases least common words from view. Dependencies: PIL, html5lib, FPDF
python3 src/erase_leastcommon.py
rm $(input-hocr)
rm $(images-tiff)
replace:tiffs hocrs ## Natasha: Analyzes pages in order, replace least common words with most common words. Dependencies: PIL, html5lib, FPDF
python3 src/replace_leastcommon.py
rm $(input-hocr)
rm $(images-tiff)