User:Manetta/graduation-proposals/proposal-0.4

From XPUB & Lens-Based wiki

graduation proposal +0.4

title: "i could have written that"

alternatives:

  • turning words into numbers

Introduction

For in those realms machines are made to behave in wondrous ways, often sufficient to dazzle even the most experienced observer. But once a particular program is unmasked, once its inner workings are explained in language sufficiently plain to induice understanding, its magic crumbles away; it stands revealed as a mere collection of procedures, each quite comprehensible. The observer says to himself "I could have written that". With that thought he moves the program in question from the shelf marked "intelligent" to that reserved for curious, fit to be discussed only with people less enlightened than he. (Joseph Weizenbaum, 1966)

what do you want to do?

"i-could-have-written-that" will be a publishing platform operating from an aim of revealing inner workings of technologies that systemize natural language, through tools that function as natural language interfaces (for both the human and machines), while regarding such technologies as reading-writing systems.

This publishing platform will report on reading-writing systems that touch the issues of systemization / automation / an algorithmic 'truth' that contain elements of simplification / probability / modeling processes ...

... by looking closely at the material (technical) elements that are used to construct these systems, in order to look for alternative perspectives.

abstract

'i-could-have-written-that' will be a publishing experiment. It aims to reveal inner workings of automatic reading-writing machines / natural language processing technologies, and reflect on what effect these techniques have on the information they transmit. In that sense, 'i-could-have-written-that' will be about a different approach to (graphic) design, one which is not desinging information, but designing information processing situations.


'#0'-issue:

intro

  • WordNet as a dataset that 'maps' language
  • Not 'mapping' as a tool to understand (as a primary aim) (as Julie speaks about mapping the physicality of the Internet) but rather 'mapping' in the sense of 'modeling', in order to automate 'natural language processes'.

→ 'automation' is key here ? (natural language processing techniques or automatic reading systems)

→ western urge to simplify / structure / archive knowledge, as sharing knowledge is regarded as something that will bring development in society for the future

(...)

(...)

(...)

elements

the following elements could be part of this issue: (now collected on this 'i-could-have-written-that' webpage)

→ WordNet as structure
i-could-have-written-that/ WordNet skeleton
→ a historical list of information processing systems
i-could-have-written-that/ historical list of information systems
→ text on automatic reading machines, 
placing automation in an optical process (1967) in contrast with an algorithmic process (2015)
i-could-have-written-that/ Automatic Reading Machines

(...)

(...)

(...)


Relation to a larger context

natural language?

Natural language could be considered as the language that evolves naturally in the human mind through repetition, a process that starts for many people at a young age. For this project i would like to look at 'natural language' from a perspective grounded in computer science, computational linguistics and artificial intelligence (AI), where natural language is mostly used in the context of 'natural language processing' (NLP), a field of studies that researches the interactions between human language and the computer.

systemizing natural language?

It is discussable if language itself could be regarded as a technology or not. For my project i will follow James Gleick's statement in his book 'The Information: a Theory, a History, a Flood'[1], where he states: Language is not a technology, (...) it is not best seen as something separate from the mind; it is what the mind does. (...) but when the word is instantiated in paper or stone, it takes on a separate existence as artifice. It is a product of tools and it is a tool. From this moment on 'language' is turned into 'written language'.

A very primary writing technology is the latin alphabet. The main set of 26 characters is a toolbox that enables us to systemize language into characters, into words, into sentences. When considering these tools as technologies, it makes it possible to follow a line from natural language to a language that computers can take as input, via various forms of mediation.

technologies that systemize natural language?

By working closely with software that is used in the fields of machine learning & text-mining, i hope to reveal the inner workings of such mediating techniques through a practical approach. Elements to work with include for example dictionaries, lexicons, lexical databases (WordNet), other datasets (ConceptNet), ngrams, and other elements that are implemented in such software.

reading technology for both the computer & the human eye

OCR-B, designed by Adrian Frutiger (1967), screenshot from the article OCR-B: A Standardized Character for Optical Recognition in The Journal of Typographic Research, V1N2-1967 (PDF)

An enthusiastic attempt to create a reading technology for both the human as also the computer's 'eye' is published in 'The Journal of Typographic Research' (V1N2-1967[2]). The article OCR-B : A Standardized Character for Optical Recognition presents 'OCR-B[3]', a typeface that is optimized for automatic machinic reading, designed by Adrian Frutiger. The author ends the article by (techno-optimistically) stating the hope that one day "reading machines" will have reached perfection and will be able to distinguish without any error the symbols of our alphabets, in whatever style they may be written.

reading / writing systems?

schema example of Pattern, a web mining module for the Python programming language.

In this same article, the author did fortold us a future wherein [a]utomatic optical reading is likely to widen the bounds of the field of data processing[2]. The term 'data-processing' is referring to typed or printed information on paper, but nowadays 'data-processing' is understood differently. Today, data-processing rather refers to techniques that 'read' natural language not through an optical process, but by perceiving language as 'data', In the field of data-mining, algorithms are trained to recognize patterns in written language. In order to be able to perceive the text mathematically, text is simplified and turned into numbers. Computers therefore 'read' by counting words and most-common-word-combinations (called bag-of-words).

Optical reading machines try to 'read' what has been written from paper directly. They try to understand what has been written, to translate it correctly into digital text. But algorithms aren't tools that perform from a 'read-only' position. Algorithmic reading does not try to 'understand' written text, but rather tries to label it as (for example) being positive or negative. An algorithm looks for patterns in the text, and is then able to compare the current pattern to a set of pre-labeled text. Is the algortihm therefore a 'reading' technology, or could pattern-recognition be seen as an act of 'writing' as well? As the algorithm is decoding the written language: first by turning text into patterns, and then by labeling it as being positive or not.

Antoinette Rouvroy speaking about big-data and its ideology of being a natural resource – youtube-video, Transmediale 2015, All Watched Over by Algorithms

Data-mining techniques are decoding processes but are hiding these decoding processes, because they follow the ideology to regard 'data' or 'raw-data' as natural objects, untouched by human hands.[4] It is a common practise to present algorithmic results as objective truths (as it is the data that speaks![5] or because no humans were even involved![6]).

This ideology seems to come very close to what has been predicted in the article from 1967: "reading machines" [that] will have reached perfection and will be able to distinguish without any error the symbols of our alphabets, in whatever style they may be written. Though in 1967 imagined as an optical reading device, isn't the perfect automatic machinic reading situation of today found in the field of data-mining? An automatic reading machine that widens the bounds of the field of data-processing? A technique being so natural, that even the data can speak?

But data-mining mediates as much as television or a telescope does. Data-mining therefore shouldn't be regarded as a 'read-only' technique, but be treated as a tool that 'reads' and 'writes' at the same time. In stead of hiding the data-processes (workflows, files and choices that have been made) in data-mining practises, i would like to reveal share information about them.

To do this, i would like to work closely with data-mining tools: data/text-mining (text-parsing, text-simplification, vector-space-models, looking at algorithmic culture), machine learning (training-sets, taxonomies, categories, annotation), logic (simplification, universal representative systems, programming languages)

publishing platform?

revealing / informing / publishing

Although algorithms become more and more present of daily life — in the form of e.g.: automatic recommendations (music playlists / Amazon products), or predictions in probability rates (suspicious behavior / climate change patterns) — their constructions become more and more complex and hence harder to depict or understand (for you, me, and sometimes even for academics themselves). Therefore i think it is important to publish about these systems, both to reveal the fascinating systems that have been developed, the attempts, the dreams, but also to present a critical take on the way that these systems construct their 'truths'.

By departing from a very technical point of view, i hope to develop a stage for alternative perspectives on these issues (making 'it-just-not-works' tutorials for example), while keeping a wide audience in mind. I don't want to exclude a broader group of people in understanding reading-writing techniques, as that is precise the critique i have on the field of data-mining.

These aims are related to cultural principles present in the field of open-source: take for example the aim for distribution in stead of centralized sources (for example: Wikipedia), the aim of making information available for everyone (in the sense that it should not only be available but also legible), and the aim for collaborative work (as opposed to ownership). These principles will influence my design choices, for example: to consider an infrastructure that enables collaborative work.

from designing information, to designing information processes

Comming from a background in graphic design, i got educated in a traditional way (focus on typography and aesthetics) in combination with courses in 'design strategy' and 'meaning-making' (which was not defined in such clear terms btw.). I became interested in semiotics, and in systems that use symbols/icons/indexes to gain meaning.

After my first year at the Piet Zwart, i feel that my interest shifts from designing information on an interface level, to designing information processes. Being fascinated by looking at inner workings of technique and being affected by the open source principles, bring up a whole set of new design questions. For example: How can an interface reveal its inner system? How can structural descisions be design actions? And how could a workflow change to the information it is processing?

I would like to include this shift in my graduation work, to let my project also be a publishing experiment, by focussing on the infrastructure and workflow of the publication(s).


Relation to previous practice

In the last year, i've been looking at different tools that process natural language. From speech-to-text software to text-mining tools, they all systemize language in various ways.

training common sense, work track at Relearn 2015

As a continutation of that i took part at the Relearn summerschool in Brussels last August (2015), to propose a work track in collaboration with Femke Snelting on the subject of 'training common sense'. With a group of people we have been trying to deconstruct the 'truth-construction' in algorithmic cultures, by looking at data mining processes, deconstructing the mathematical models that are used, finding moments where semantics are mixed with mathematics, and trying to grasp what kind of cultural context is created around this field. We worked with a text-mining software package called 'Pattern'. The workshop during Relearn transformed into a project that we called '#!Pattern+, and will be strongly collaborative and ongoing over a longer time span. #!Pattern+ will be a critical fork of the latest version of Pattern, including reflections and notes on the software and the culture it is surrounded within. The README file that has been written for #!PATTERN+ is online here, and more information is collected on this wiki page.

i will tell you everything (my truth is a constructed truth") catalog of "Encyclopedia of Media Object" in V2, June 2015

Another entrance to understanding what happens in algorithmic practises such as machine learning, is by looking at training sets that are used to train algorithms to recognize certain patterns in a set of data. These training sets could contain a large set of images, texts, 3d models, or video's. By looking at such datasets, and more specifically at the choices that have been made in terms of structure and hierarchy, steps of the construction a certain 'truth' are revealed. For the exhibition "Encyclopedia of Media Object" in V2 last June, i created a catalog, voice over and booklet, which placed the objects from the exhibition within the framework of the SUN database, a resource of images for image recognition purposes. (link to the "i-will-tell-you-everything (my truth is a constructed truth" interface)

There are a few datasets in the academic world that seem to be basic resources to built these training sets upon. In the field they are called 'knowledge bases'. They live on a more abstract level then the training sets do, as they try to create a 'knowlegde system' that could function as a universal structure. Examples are WordNet (a lexical dataset), ConceptNet, and OpenCyc (an ontology dataset). In the last months i've been looking into WordNet, worked on a WordNet Tour (still ongoing), and made an alternative browser interface (with cgi) for WordNet. It's all a process that is not yet transformed in an object/product, but untill now documented here and here on the Piet Zwart wiki.

Thesis intention

I would like to integrate my thesis in my graduation project, to let it be the content of the publication(s). This could take multiple forms, for example:

  • interview with creators of datasets or lexicons like WordNet
  • close reading of a piece of software, like we did during the workshop at Relearn. Options could be: text-mining software Pattern (Relearn), or Wecka 3.0; or WordNet, ConceptNet, OpenCyc


Practical steps

how?

  • starting a series of reading/writing excercises, in continuation of the way of working in the prototype classes and during Relearn.
    • mapping WordNet's structure
    • using WordNet as a writing filter?
    • WordNet as structure for a collection (similar to the way i've used the SUN database)

while using open-source software, in order to be able to have a conversation with the tools that will be discussed, open them up.

questions of research

  • How can an interface reveal its inner system? How can structural descisions be design actions? And how could a workflow change to the information it is processing?
  • how to communicate an alternative view on algorithmic reading-writing machines?
  • how to built and maintain a (collaborative) publishing project?
    • technically: what kind of system to use to collect? wiki? mailinglist interface?
    • what kind of system to use to publish?
    • publishing: online + print --> inter-relation
    • in what context ?

references

current or former (related) magazines :

other publishing platforms :

publications :

datasets

* WordNet (Princeton)
* ConceptNet 5 (MIT Media)
* OpenCyc

people

algorithmic culture

Luciana Parisi
Matteo Pasquinelli
Antoinette Roivoy
Seda Gurses 

other

Software Studies. A lexicon. by Matthew Fuller (2008)

reading list

notes and related projects

BAK lecture: Matthew Fuller, on the discourse of the powerpoint (Jun. 2015) - annotations

project: Wordnet

project: i will tell you everything (my truth is a constructed truth)

project: serving simulations