Revision as of 08:59, 8 February 2016

outline

intro

NLP

With 'i-could-have-written-that' i would like to look at technologies that process natural language (NLP). There is a range of different expectations of NLP systems, but a full-coverage of natural language is unlikely. By regarding NLP software as cultural objects, i'll focus on the inner workings of their technologies: what are the technical and social mechanisms that systemize our natural language in order for it to be understood by a computer?

NLP is a category of software packages that is concerned with the interaction between human language and machine language. NLP is mainly present in the field of computer science, artificial intelligence and computational linguistics. On a daily basis people deal with services that contain NLP techniques: translation engines, search engines, speech recognition, auto-correction, chatbots, OCR (optical character recognition), license plate detection, data-mining. For 'i-could-have-written-that', i would like to place NLP software central, not only as technology but also as a cultural object, to reveal in which way NLP software is constructed to understand human language, and what side-effects these techniques have.

knowledge discovery in data (data-mining)

For the occassion of the graduating project of this year, i would like to focus on the practise of text-mining, which is a subgroup of the so called field of 'data mining'.

title: i could have written that

Text mining is part of an analytical practise of searching for patterns in text following "A Data Driven Approach", and assigning these to (predefined) profiles. It is part of a bigger information-construction process (called: Knowledge Discovery in Data, KDD) which implies source-selection, data-creation, simplification, translation into vectors, and testing techniques.

context

Text mining is a political sensitive technology, closely related to surveillance and privacy discussions around 'big data'. The technique in the middle of tense discussions about capturing people's behavior for security reasons, but affect the privacy of a lot of people — accompanied by an unpleasant controlling force that seems to be omnipresent. After the disclosures of the NSA's data capturing program by Edward Snowden in 2013, a wider public became aware of the silent data collecting activities done by a governmental agency on for example phone-metadata. The UK law made special exceptions in their copyright laws to make text mining practises possible on intellectual property for non-commercial use since October 2014. Problematic is the skewed balance between data-producer and data-analytics also framed as 'data colonialism', and the accompanied governmental-role that gives to data-analytics for example by construction your search-results-list according to your data-profile.

The magical effects of text mining results, caused by the difficulty of understanding the construction of these results, makes it difficult to formulate an opinion about text mining techniques. It makes it even difficult to formulate what the problem exactly is, as many people are tending to agree with the calculations and word-counts that seemly are excecuted. "What is exactly the problem?", and "This is the data that speaks, right?", are questions that need to be challenged in order to have a conversation about text mining techniques at all.

hypothesis

The results of data-mining software are not mined, results are constructed.

audience

This thesis will aim for a public that is interested in an alternative perspective on buzzwords like 'big data' and 'data-mining'. Also, this thesis will (hopefully!) offer a view from a computer-vision side: how software is written to understand the non-computer world of written text.

reading technique: a 'key value pair' in vectors

Data is seen as a material that easily can be extracted from the web, and regarded to be that little 'truth-snapshot' that is taken at a certain time and moment. In text mining, the material that is used as input are written pieces of texts transformed into data. The data is formatted as a word + number, a feature + weight, a key + value pair, or how a linguist would call it: a subject + predicate combination. This is the material building block where text mining outcomes are constructed with. Immaterial building blocks are: point-of-departures, source-selection, noise-reduction, and test-techniques. All these elements effect the vector / the model / the algorithm.

what is the position of the key-value pair in the text mining process?
- description of the 5 KDD steps

what side effects does it have to describe a profile in key-value pairs?
- using written words is already a representation: the word represents the meaning/thought/intention
- in text mining, written text is aimed to be interpreted from a reader's perspective (as opposed to the writer's intention)
- key-value pair is a representational format: the value represents the key

systemization of language & NLP (a dream)

where could the key-value format be traced back from?
- computer history?
- logic
- rational tradition & philosophy

Text mining seems to be a rather brutal way to deal with the aim to process natural language into useful information. To reflect on this brutality, tracing back a longer tradition of natural language processing could be usefull. Hopefully this will be a way to create some distance to the hurricanes of data that are mainly known as 'big', 'raw' or 'mined' these days.

By looking at projects from the past in the field of NLP (being part of AI/computer science departments), this chapter will attempt to highlight their attitudes towards written text.

(The systemization of language through key-value pairs follows a Western tradition that is based on the convention that there is an individual that perceives, and there is an outsie world that contains its own natural truth. The linguist Austin rather shows that language is merely a speech act, happening as a social act. In these speech acts there is no such external objective meaning of the words we use in language, that meaning only exists in social relations. Heidegger even goes further, and says that we should not split up the object from the predicate. For example: while 'hammering' the person that hammers is not regarding the hammer in a reflective sense. The person is in the moment of using the hammer to achieve something. The only moment when the person would be confronted with a representational sense of the hammer, is when the hammer breaks down. It is at that moment that the person will need to fix the hammer, and learn a bit more about what a hammer 'is'.)

circularity

could text mining software be regarded as cybernetic feedback system?
- in which the process of text mining is not about 'finding' if assumptions can be confirmed, but it is engineered to be so
- applied to key-value appearances in other traditions?

material

bibliography (five key texts)

Matthew Fuller - short presentation of the poem: Blue Notebook #10 / The Red-haired Man, during Ideographies of Knowledge, Mundaneum, Mons (Oct. 2015); annotations
Joseph Weizenbaum - Computer Power and Human Reason: From Judgement to Calculation (1976);
Winograd + Flores - Understanding Computers & Cognition (1987);
Vilem Flusser - Towards a Philosophy of Photography (1983); → annotations
Antoinette Rouvroy - All Watched Over By Algorithms - Transmediale (Jan. 2015); → annotations
The Journal of Typographic Research - OCR-B: A Standardized Character for Optical Recognition this article (V1N2) (1967); → abstract

annotations

Alan Turing - Computing Machinery and Intelligence (1936)
The Journal of Typographic Research - OCR-B: A Standardized Character for Optical Recognition this article (V1N2) (1967); → abstract
Ted Nelson - Computer Lib & Dream Machines (1974);
Joseph Weizenbaum - Computer Power and Human Reason (1976); → annotations
Water J. Ong - Orality and Literacy (1982);
Vilem Flusser - Towards a Philosophy of Photography (1983); → annotations
Christiane Fellbaum - WordNet, an Electronic Lexical Database (1998);
Charles Petzold - Code, the hidden languages and inner structures of computer hardware and software (2000); → annotations
John Hopcroft, Rajeev Motwani, Jeffrey Ullman - Introduction to Automata Theory, Languages, and Computation (2001);
James Gleick - The Information, a History, a Theory, a Flood (2008); → annotations
Matthew Fuller - Software Studies. A lexicon (2008);
- Language, Florian Cramer; → annotations
- Algorithm, Andrew Goffey;
Marissa Meyer - the physics of data, lecture (2009); → annotations
Matthew Fuller & Andrew Goffey - Evil Media (2012); → annotations
Antoinette Rouvroy - All Watched Over By Algorithms - Transmediale (Jan. 2015); → annotations
Benjamin Bratton - Outing A.I., Beyond the Turing test (Feb. 2015) → annotations
Ramon Amaro - Colossal Data and Black Futures, lecture (Okt. 2015); → annotations
Benjamin Bratton - On A.I. and Cities : Platform Design, Algorithmic Perception, and Urban Geopolitics (Nov. 2015);

currently working on

* terminology: data 'mining'
* Knowledge Discovery in Data (KDD) in the wild, problem formulations
* KDD, applications
* KDD, workflow
* text-processing: simplification
* list of data mining parties

@@ Line 1: / Line 1: @@
 <div style="width:100%;max-width:800px;">
 =outline=
-[[User:Manetta/thesis/thesis-outline-nlp | outline-thesis (2) &rarr; NLP]]
 ==intro==
 ===NLP===
@@ Line 10: / Line 10: @@
 ===knowledge discovery in data (data-mining)===
 For the occassion of the graduating project of this year, i would like to focus on the practise of text-mining, which is a subgroup of the so called field of 'data mining'.
 =title: i could have written that=

User:Manetta/thesis/thesis-outline: Difference between revisions