User:Manetta/graduation-proposals/proposal-0.4
graduation proposal +0.4
title: "i could have written that"
alternatives:
- turning words into numbers
Introduction
For in those realms machines are made to behave in wondrous ways, often sufficient to dazzle even the most experienced observer. But once a particular program is unmasked, once its inner workings are explained in language sufficiently plain to induice understanding, its magic crumbles away; it stands revealed as a mere collection of procedures, each quite comprehensible. The observer says to himself "I could have written that". With that thought he moves the program in question from the shelf marked "intelligent" to that reserved for curious, fit to be discussed only with people less enlightened than he. (Joseph Weizenbaum, 1966)
what do you want to do?
"i-could-have-written-that" will be a publishing platform operating from an aim of revealing inner workings of technologies that systemize natural language, through tools that function as natural language interfaces (for both the human and machines), while regarding such technologies as reading-writing systems.
This publishing platform will report on reading-writing systems that touch the issues of systemization / automation / an algorithmic 'truth' that contain elements of simplification / probability / modeling processes ...
... by looking closely at the material (technical) elements that are used to construct these systems, in order to look for alternative perspectives.
abstract
'i-could-have-written-that' will be a publishing experiment. It aims to reveal inner workings of automatic reading-writing machines / natural language processing technologies, and reflect on what effect these techniques have on the information they transmit. In that sense, 'i-could-have-written-that' will be about a different approach to (graphic) design, one which is not desinging information, but designing information processing situations.
'#0'-issue:
intro
- WordNet as a dataset that 'maps' language
- Not 'mapping' as a tool to understand (as a primary aim) (as Julie speaks about mapping the physicality of the Internet) but rather 'mapping' in the sense of 'modeling', in order to automate 'natural language processes'.
→ 'automation' is key here ? (natural language processing techniques or automatic reading systems)
→ western urge to simplify / structure / archive knowledge, as sharing knowledge is regarded as something that will bring development in society for the future
(...)
(...)
(...)
elements
the following elements could be part of this issue: (now collected on this 'i-could-have-written-that' webpage)
→ WordNet as structure i-could-have-written-that/ WordNet skeleton
→ a historical list of information processing systems i-could-have-written-that/ historical list of information systems
→ text on automatic reading machines, placing automation in an optical process (1967) in contrast with an algorithmic process (2015) i-could-have-written-that/ Automatic Reading Machines
(...)
(...)
(...)
Relation to a larger context
natural language?
Natural language could be considered as the language that evolves naturally in the human mind through repetition, a process that starts for many people at a young age. For this project i would like to look at 'natural language' from a perspective grounded in computer science, computational linguistics and artificial intelligence (AI), where natural language is mostly used in the context of 'natural language processing' (NLP), a field of studies that researches the interactions between human language and the computer.
systemizing natural language?
It is discussable if language itself could be regarded as a technology or not. For my project i will follow James Gleick's statement in his book 'The Information: a Theory, a History, a Flood'[1], where he states: Language is not a technology, (...) it is not best seen as something separate from the mind; it is what the mind does. (...) but when the word is instantiated in paper or stone, it takes on a separate existence as artifice. It is a product of tools and it is a tool. From this moment on 'language' is turned into 'written language'.
A very primary writing technology is the latin alphabet. The main set of 26 characters is a toolbox that enables us to systemize language into characters, into words, into sentences. When considering these tools as technologies, it makes it possible to follow a line from natural language to a language that computers can take as input, via various forms of mediation.
technologies that systemize natural language?
By working closely with software that is used in the fields of machine learning & text-mining, i hope to reveal the inner workings of such mediating techniques through a practical approach. Elements to work with include for example dictionaries, lexicons, lexical databases (WordNet), other datasets (ConceptNet), ngrams, and other elements that are implemented in such software.
reading technology for both the computer & the human eye
An enthusiastic attempt to create a reading technology for both the human as also the computer's 'eye' is published in 'The Journal of Typographic Research' (V1N2-1967[2]). The article OCR-B : A Standardized Character for Optical Recognition presents 'OCR-B[3]', a typeface that is optimized for automatic machinic reading, designed by Adrian Frutiger. The author ends the article by (techno-optimistically) stating the hope that one day "reading machines" will have reached perfection and will be able to distinguish without any error the symbols of our alphabets, in whatever style they may be written.
reading / writing systems?
In this same article, the author did fortold us a future wherein [a]utomatic optical reading is likely to widen the bounds of the field of data processing[2]. The term 'data-processing' is referring to typed or printed information on paper, but nowadays 'data-processing' is understood differently. Today, data-processing rather refers to techniques that 'read' natural language not through an optical process, but by perceiving language as 'data', In the field of data-mining, algorithms are trained to recognize patterns in written language. In order to be able to perceive the text mathematically, text is simplified and turned into numbers. Computers therefore 'read' by counting words and most-common-word-combinations (called bag-of-words).
Optical reading machines try to 'read' what has been written from paper directly. They try to understand what has been written, to translate it correctly into digital text. But algorithms aren't tools that perform from a 'read-only' position. Algorithmic reading does not try to 'understand' written text, but rather tries to label it as (for example) being positive or negative. An algorithm looks for patterns in the text, and is then able to compare the current pattern to a set of pre-labeled text. Is the algortihm therefore a 'reading' technology, or could pattern-recognition be seen as an act of 'writing' as well? As the algorithm is decoding the written language: first by turning text into patterns, and then by labeling it as being positive or not.
Data-mining techniques are decoding processes but are hiding these decoding processes, because they follow the ideology to regard 'data' or 'raw-data' as natural objects, untouched by human hands.[4] It is a common practise to present algorithmic results as objective truths (as it is the data that speaks![5] or because no humans were even involved![6]).
This ideology seems to come very close to what has been predicted in the article from 1967: "reading machines" [that] will have reached perfection and will be able to distinguish without any error the symbols of our alphabets, in whatever style they may be written. Though in 1967 imagined as an optical reading device, isn't the perfect automatic machinic reading situation of today found in the field of data-mining? An automatic reading machine that widens the bounds of the field of data-processing? A technique being so natural, that even the data can speak?
But data-mining mediates as much as television or a telescope does. Data-mining therefore shouldn't be regarded as a 'read-only' technique, but be treated as a tool that 'reads' and 'writes' at the same time. In stead of hiding the data-processes (workflows, files and choices that have been made) in data-mining practises, i would like to reveal share information about them.
To do this, i would like to work closely with data-mining tools: data/text-mining (text-parsing, text-simplification, vector-space-models, looking at algorithmic culture), machine learning (training-sets, taxonomies, categories, annotation), logic (simplification, universal representative systems, programming languages)
publishing platform?
revealing / informing / publishing
Although algorithms become more and more present of daily life — in the form of e.g.: automatic recommendations (music playlists / Amazon products), or predictions in probability rates (suspicious behavior / climate change patterns) — their constructions become more and more complex and hence harder to depict or understand (for you, me, and sometimes even for academics themselves). Therefore i think it is important to publish about these systems, both to reveal the fascinating systems that have been developed, the attempts, the dreams, but also to present a critical take on the way that these systems construct their 'truths'.
By departing from a very technical point of view, i hope to develop a stage for alternative perspectives on these issues (making 'it-just-not-works' tutorials for example), while keeping a wide audience in mind. I don't want to exclude a broader group of people in understanding reading-writing techniques, as that is precise the critique i have on the field of data-mining.
These aims are related to cultural principles present in the field of open-source: take for example the aim for distribution in stead of centralized sources (for example: Wikipedia), the aim of making information available for everyone (in the sense that it should not only be available but also legible), and the aim for collaborative work (as opposed to ownership). These principles will influence my design choices, for example: to consider an infrastructure that enables collaborative work.
from designing information, to designing information processes
Comming from a background in graphic design, i got educated in a traditional way (focus on typography and aesthetics) in combination with courses in 'design strategy' and 'meaning-making' (which was not defined in such clear terms btw.). I became interested in semiotics, and in systems that use symbols/icons/indexes to gain meaning.
After my first year at the Piet Zwart, i feel that my interest shifts from designing information on an interface level, to designing information processes. Being fascinated by looking at inner workings of technique and being affected by the open source principles, bring up a whole set of new design questions. For example: How can an interface reveal its inner system? How can structural descisions be design actions? And how could a workflow change to the information it is processing?
I would like to include this shift in my graduation work, to let my project also be a publishing experiment, by focussing on the infrastructure and workflow of the publication(s).
Relation to previous practice
In the last year, i've been looking at different tools that process natural language. From speech-to-text software to text-mining tools, they all systemize language in various ways.
As a continutation of that i took part at the Relearn summerschool in Brussels last August (2015), to propose a work track in collaboration with Femke Snelting on the subject of 'training common sense'. With a group of people we have been trying to deconstruct the 'truth-construction' in algorithmic cultures, by looking at data mining processes, deconstructing the mathematical models that are used, finding moments where semantics are mixed with mathematics, and trying to grasp what kind of cultural context is created around this field. We worked with a text-mining software package called 'Pattern'. The workshop during Relearn transformed into a project that we called '#!Pattern+, and will be strongly collaborative and ongoing over a longer time span. #!Pattern+ will be a critical fork of the latest version of Pattern, including reflections and notes on the software and the culture it is surrounded within. The README file that has been written for #!PATTERN+ is online here, and more information is collected on this wiki page.
Another entrance to understanding what happens in algorithmic practises such as machine learning, is by looking at training sets that are used to train algorithms to recognize certain patterns in a set of data. These training sets could contain a large set of images, texts, 3d models, or video's. By looking at such datasets, and more specifically at the choices that have been made in terms of structure and hierarchy, steps of the construction a certain 'truth' are revealed. For the exhibition "Encyclopedia of Media Object" in V2 last June, i created a catalog, voice over and booklet, which placed the objects from the exhibition within the framework of the SUN database, a resource of images for image recognition purposes. (link to the "i-will-tell-you-everything (my truth is a constructed truth" interface)
There are a few datasets in the academic world that seem to be basic resources to built these training sets upon. In the field they are called 'knowledge bases'. They live on a more abstract level then the training sets do, as they try to create a 'knowlegde system' that could function as a universal structure. Examples are WordNet (a lexical dataset), ConceptNet, and OpenCyc (an ontology dataset). In the last months i've been looking into WordNet, worked on a WordNet Tour (still ongoing), and made an alternative browser interface (with cgi) for WordNet. It's all a process that is not yet transformed in an object/product, but untill now documented here and here on the Piet Zwart wiki.
Thesis intention
I would like to integrate my thesis in my graduation project, to let it be the content of the publication(s). This could take multiple forms, for example:
- interview with creators of datasets or lexicons like WordNet
- close reading of a piece of software, like we did during the workshop at Relearn. Options could be: text-mining software Pattern (Relearn), or Wecka 3.0; or WordNet, ConceptNet, OpenCyc
Practical steps
how?
- creating a historical context, a list of information processing systems, started here: i-could-have-written-that/ historical list of information systems
- creating a context of automatic 'reading' machines, started here: i-could-have-written-that/ Automatic Reading Machines
- starting a series of reading/writing excercises, in continuation of the way of working in the prototype classes and during Relearn.
- mapping WordNet's structure
- using WordNet as a writing filter?
- WordNet as structure for a collection (similar to the way i've used the SUN database)
while using open-source software, in order to be able to have a conversation with the tools that will be discussed, open them up.
questions of research
- How can an interface reveal its inner system? How can structural descisions be design actions? And how could a workflow change to the information it is processing?
- how to communicate an alternative view on algorithmic reading-writing machines?
- how to built and maintain a (collaborative) publishing project?
- technically: what kind of system to use to collect? wiki? mailinglist interface?
- what kind of system to use to publish?
- publishing: online + print --> inter-relation
- in what context ?
references
- ↑ James Gleick's personal webpage, The Information: a Theory, a History, a Flood - James Gleick (2011)
- ↑ 2.0 2.1 The Journal of Typographic Research, V1N2-1967 (PDF), published between 1967 and 1971, then transformed into 'Visible Language'
- ↑ OCR-B on Linotype, designed by Adrian Frutiger in 1967
- ↑ Presentation by Antoinette Rouvroy – All Watched Over by Algorithms
- ↑ TED Talk PENN, Lyle Ungar presenting text mining results; "I'm sorry" "but these are the words", more info: on i-could-have-written-that page
- ↑ Yahoo help section for the Friendly Flickr Bot, "The process is fully automated, so no humans are ever involved in tagging your images." more info: http://pzwart1.wdka.hro.nl/~manetta/i-could-have-written-that/elements/flickr_s-friendly-robots/flickr_s-friendly-robots.html
- The Journal of Typographic Research (1967-1971) (now: Visible Language)
- Radical Software (1970-1974, NY)
- die Datenschleuder, Chaos Computer Club publication (1984-ongoing, DE)
- Dot Dot Dot (2000-2011, USA)
- the Serving Library (2011-ongoing, USA)
- OASE, on architecture (NL)
- Libre Graphics Magazine (2010-ongoing) PR)
- Works that Work (2013-ongoing, NL)
- Neural (IT)
- Aprja (DK)
other publishing platforms :
- Monoskop
- unfold.thevolumeproject.org
- mailinglist interface: lurk.org
- mailinglist interface: nettime --> discussions in public
- archive of publications closely related to technology: P-DPA (Silvio Larusso)
publications :
- art post-internet (2014), a PDF + webpage catalogue
- Hybrid Lecture Player (to be viewed in Chrome/Chromium)
datasets
* WordNet (Princeton)
* ConceptNet 5 (MIT Media)
* OpenCyc
people
algorithmic culture
Luciana Parisi Matteo Pasquinelli Antoinette Roivoy Seda Gurses
other
Software Studies. A lexicon. by Matthew Fuller (2008)
reading list
BAK lecture: Matthew Fuller, on the discourse of the powerpoint (Jun. 2015) - annotations
project: i will tell you everything (my truth is a constructed truth)