User:Manetta/thesis/thesis-outline
outline
title: i could have written that
Text mining is part of an analytical practise of searching for patterns in text following "A Data Driven Approach", and assigning these to (predefined) profiles. It is part of a bigger information-construction process (called: Knowledge Discovery in Data, KDD) which implies source-selection, data-creation, simplification, translation into vectors, and testing techniques.
intro
problematic situation
short description here, plus focus on text mining that is mainly regarded as a reading technique. what happens if text mining would be seen as a writing technique?
in-between-language / inter-language
OCR A + B example. Two fonts that did consessions in their precence, and form objects that are optimized for both machine and human reading processes.
NLP
In the whole project 'i-could-have-written-that', i would like to place NLP software central, not only as technology but also as a cultural object, to reveal in which way NLP software is constructed to understand human language, and what side-effects these techniques have.
hypothesis
The results of data-mining software are not mined, results are constructed.
audience
This thesis will aim for an audience that is interested in an alternative perspective on buzzwords like 'big data' and 'data-mining'. Also, this thesis will (hopefully!) offer a view from a computer-vision side: how software is written to understand the non-computer world of written text.
1 - text mining; what if text mining is seen as a writing machine?
If text mining is regarded to be a writing system, what and where does it write?
1.1 - text mining culture
- What are the levels of construction in text mining software culture?
- By considering text mining technology as a reading machine?
- terminology
- How does the metaphor of 'mining' effect the process?
- 'data mining' → Knowledge Discovery in Data (KDD)
- How much can be based on a cultural side-product (like the text that is commonly used, as it is extracted from ie. social media)?
1.2 – Pattern's gray spots
- What are the levels of construction in text mining software itself?
- What gray spots appear when text is processed?
- What is meant by 'grayness'? How can it be used as an approach to software critique?
- Text processing: how does written text transform into data?
- Bag-of-words, or 'document'
- 'count' → 'weight'
- trial-and-error, modeling the line
- Testing process
- how is an algorithm actually only 'right' according to that particular test-set, with its own particularities and exceptions?
- loops to improve your outcomes
- What gray spots appear when text is processed?
1.3 – text mining (context)
- what are text mining applications? listed here
Showing that text mining has been applied across very different field, and thereby seeming to be a sort of 'holy grail', solving a lot of problems. (though i'm not sure if this is needed)
1 - 'key value pair' focus
- what is the position of the key-value pair in the text mining process?
- description of the 5 KDD steps
- what side effects does it have to describe a profile in key-value pairs?
- using written words is already a representation: the word represents the meaning/thought/intention
- in text mining, written text is aimed to be interpreted from a reader's perspective (as opposed to the writer's intention)
- key-value pair is a representational format: the value represents the key
2 - context and history
- where could the key-value format be traced back from?
- in a tradition of systematizing language?
- Are these aims not relying to much on a rational tradition? Austin's 'Speech Act theory' & Heidegger's 'dasein' and opinion that we “should not split up the object from the predicate”
- in a tradition of data-processing?
- in a tradition of AI/NLP projects like ELIZA & SHRDLU (?)?
- in a tradition of systematizing language?
Text mining seems to be a rather brutal way to deal with the aim to process natural language into useful information. To reflect on this brutality, tracing back a longer tradition of natural language processing could be usefull. Hopefully this will be a way to create some distance to the hurricanes of data that are mainly known as 'big', 'raw' or 'mined' these days.
By looking at projects from the past in the field of NLP (being part of AI/computer science departments), this chapter will attempt to highlight their attitudes towards written text.
3 - circularity (reasoning)
A text mining process is not aiming to 'find' if assumptions can be confirmed, it is rather engineered to be so.
- text mining & an iterating workflow
- Pattern's workflow more here
- IBM's 'iterating' graph
- heliocentric/geocentric diagrams to understand the universe → complexity confirms the input (here: the perception of stars and their positions) Could this be the case in mining practices?
material
bibliography (five key texts)
- Matthew Fuller - short presentation of the poem: Blue Notebook #10 / The Red-haired Man, during Ideographies of Knowledge, Mundaneum, Mons (Oct. 2015); annotations
- Joseph Weizenbaum - Computer Power and Human Reason: From Judgement to Calculation (1976);
- Winograd + Flores - Understanding Computers & Cognition (1987);
- Vilem Flusser - Towards a Philosophy of Photography (1983); → annotations
- Antoinette Rouvroy - All Watched Over By Algorithms - Transmediale (Jan. 2015); → annotations
- The Journal of Typographic Research - OCR-B: A Standardized Character for Optical Recognition this article (V1N2) (1967); → abstract
annotations
- Alan Turing - Computing Machinery and Intelligence (1936)
- The Journal of Typographic Research - OCR-B: A Standardized Character for Optical Recognition this article (V1N2) (1967); → abstract
- Ted Nelson - Computer Lib & Dream Machines (1974);
- Joseph Weizenbaum - Computer Power and Human Reason (1976); → annotations
- Water J. Ong - Orality and Literacy (1982);
- Vilem Flusser - Towards a Philosophy of Photography (1983); → annotations
- Christiane Fellbaum - WordNet, an Electronic Lexical Database (1998);
- Charles Petzold - Code, the hidden languages and inner structures of computer hardware and software (2000); → annotations
- John Hopcroft, Rajeev Motwani, Jeffrey Ullman - Introduction to Automata Theory, Languages, and Computation (2001);
- James Gleick - The Information, a History, a Theory, a Flood (2008); → annotations
- Matthew Fuller - Software Studies. A lexicon (2008);
- Language, Florian Cramer; → annotations
- Algorithm, Andrew Goffey;
- Marissa Meyer - the physics of data, lecture (2009); → annotations
- Matthew Fuller & Andrew Goffey - Evil Media (2012); → annotations
- Antoinette Rouvroy - All Watched Over By Algorithms - Transmediale (Jan. 2015); → annotations
- Benjamin Bratton - Outing A.I., Beyond the Turing test (Feb. 2015) → annotations
- Ramon Amaro - Colossal Data and Black Futures, lecture (Okt. 2015); → annotations
- Benjamin Bratton - On A.I. and Cities : Platform Design, Algorithmic Perception, and Urban Geopolitics (Nov. 2015);
currently working on
* terminology: data 'mining'
* Knowledge Discovery in Data (KDD) in the wild, problem formulations
* KDD, applications
* KDD, workflow
* text-processing: simplification
* list of data mining parties