User:Manetta/thesis/thesis-outline: Difference between revisions

From XPUB & Lens-Based wiki
No edit summary
No edit summary
 
(13 intermediate revisions by the same user not shown)
Line 1: Line 1:
<div style="width:100%;max-width:800px;">
<div style="width:100%;max-width:800px;">
=outline=
=outline - i could have written that=


==intro==
==intro==
===problematic situation===
Written language is primarily a communication technology. Text mining is an undesired side effect of the information economy (ref...!). Text mining becomes part of business plans, where tracking of online-behavior is crucial to make profitable deals with advertisers. But next to mining-business-plans, text mining becomes a technology that seems to be able to 'extract' how people feel. A commonly applied algorithm is the sentiment algorithm, used for opinion mining for example on Twitter, to be able to use tweeted material as part of news-reports or decision making processes. The World Well Being Project goes even a step further, and aims to use Twitter to reveal “how social media can also be used to gain psychological insights“ (http://wwbp.org/papers/sam2013-dla.pdf).


Text mining seems to go beyond its own capabilities here, by convincing people to believe that it is the data that 'speaks'. The actual process is hardly re-traceable, the output explains intangible phenomena, and it seems to be that the process is automated and therefor precise.
===text analytics < > systemization of language===
This text originates from an interest in the systemization of language that is needed for computer software to be able to 'understand' and process written language.  


The issue that i would like to put central is the fact that text mining technologies are regarded as analytical 'reading' machines that extract information from large sets of written text. (→ consequences of 'objectiveness', claims that 'no humans are involved' in such automated processes because it is 'the data speaks'). But in its process, it rather shows more similarities with a writing process. What happens, if text mining software is used as writing technique?
===vocabulary===
* buzzwords (machine learning, big data, data mining) (ref to Florian Cramer)
* metaphor (too much ???)
* one of the five KDD steps


===in-between-language / inter-language ===
===problematic situation===
OCR A + B example. Two fonts that did consessions in their precence, and form objects that are optimized for both machine and human reading processes.
...


===NLP===
Text mining seems to be a rather brutal way to deal with the aim to process natural language into useful information. To reflect on this brutality, tracing back a longer tradition of natural language processing could be usefull. Hopefully this will be a way to create some distance to the hurricanes of data that are mainly known as 'big', 'raw' or 'mined' these days.
In the whole project 'i-could-have-written-that', i would like to place NLP software central, not only as technology but also as a cultural object, to reveal in which way NLP software is constructed to understand human language, and what side-effects these techniques have.


===knowledge discovery in data (data-mining)===
===audience===
For the occassion of the graduating project of this year, i would like to focus on the practise of text-mining, which is a subgroup of the so called field of 'data mining'.
This thesis will aim for an audience that is interested in an alternative perspective on buzzwords like 'big data' and 'data-mining'. Also, this thesis will (hopefully!) offer a view from a computer-vision side: how software is written to understand the non-computer world of written text.
 
=title: i could have written that=
Text mining is part of an analytical practise of searching for patterns in text following "A Data Driven Approach", and assigning these to (predefined) profiles. It is part of a bigger information-construction process (called: Knowledge Discovery in Data, KDD) which implies source-selection, data-creation, simplification, translation into vectors, and testing techniques.


==hypothesis==
==hypothesis==
The results of data-mining software are not mined, results are constructed. <br>
The results of data-mining software are not mined, results are constructed. <br>


==audience==
==chapter 1: on what basis? three settings to highlight differences in text analytical ideologies==
This thesis will aim for an audience that is interested in an alternative perspective on buzzwords like 'big data' and 'data-mining'. Also, this thesis will (hopefully!) offer a view from a computer-vision side: how software is written to understand the non-computer world of written text.
* setting 1: PhD candidate's thesis defence, Faculty of Economics, Erasmus University Rotterdam
* setting 2: Lyle Unger's TED Talk, World Well Being Project, Faculty of Psychology, University of Pennsylvania
* setting 3: Guy de Pauw's introduction on text mining software, CLiPS, Faculty of Arts & Philosophy, Computational Linguistics & Psycholinguistics department, University of Antwerp


==1 - text mining; what if i regard it as writing machine?==
==chapter 2: deriving information from written text &rarr; the material form of language==
If text mining is regarded to be a writing system, what and where does it write?
* statistical text analytics is not 'read-only', it's writing
===1.1 - text mining culture===
** to extract? &rarr; to derive
* What are the levels of construction in text mining software culture?
* written language as source material
** By considering text mining technology as a reading machine?
** analogy to typography, dealing with the optical materiality of words/sentences/text
** terminology
** text analytics dealing with the quantifiable and structural materiality of words/sentences/text
*** How does the metaphor of 'mining' effect the process?
*** word-counts
*** 'data mining' &rarr; Knowledge Discovery in Data (KDD)
*** word-order/structure
** How much can be based on a cultural side-product (like the text that is commonly used, as it is extracted from ie. social media)?
* what do these material analyses represent?
** key-value format (?)


=== 1.2 – Pattern's gray spots===
==chapter 3: information extraction / text categorization. diving into the software!==
* What are the levels of construction in text mining software itself?
* unsupervised
** What gray spots appear when text is processed?
* supervised
***What is meant by 'grayness'? How can it be used as an approach to software critique?
*** Text processing: how does written text transform into data?
*** Bag-of-words, or 'document'
*** 'count' → 'weight'
*** trial-and-error, modeling the line
** Testing process
*** how is an algorithm actually only 'right' according to that particular test-set, with its own particularities and exceptions?
***loops to improve your outcomes


===1.3 – text mining (context)===
* what are text mining applications? [[User:Manetta/i-could-have-written-that/kdd-applications|listed here]]
Showing that text mining has been applied across very different field, and thereby seeming to be a sort of 'holy grail', solving a lot of problems. (though i'm not sure if this is needed)
== 1 - 'key value pair' focus ==
<div style="color:gray;"> Data is seen as a material that easily can be extracted from the web, and regarded to be that little 'truth-snapshot' that is taken at a certain time and moment. In text mining, the material that is used as input are written pieces of texts transformed into data. The data is formatted as a word + number, a feature + weight, a key + value pair, or how a linguist would call it: a subject + predicate combination. This is the material building block where text mining outcomes are constructed with. Immaterial building blocks are: point-of-departures, source-selection, noise-reduction, and test-techniques. All these elements effect the vector / the model / the algorithm.
* what is the position of the key-value pair in the text mining process?
** description of the 5 KDD steps
* what side effects does it have to describe a profile in key-value pairs?
** using written words is already a representation: the word represents the meaning/thought/intention
** in text mining, written text is aimed to be interpreted from a reader's perspective (as opposed to the writer's intention)
** key-value pair is a representational format: the value represents the key
</div>
==2 - context and history==
* where could the key-value format be traced back from?
** in a tradition of systematizing language?
*** Are these aims not relying to much on a rational tradition? Austin's 'Speech Act theory' & Heidegger's 'dasein' and opinion that we “should not split up the object from the predicate”
** in a tradition of data-processing?
** in a tradition of AI/NLP projects like ELIZA & SHRDLU (?)?
Text mining seems to be a rather brutal way to deal with the aim to process natural language into useful information. To reflect on this brutality, tracing back a longer tradition of natural language processing could be usefull. Hopefully this will be a way to create some distance to the hurricanes of data that are mainly known as 'big', 'raw' or 'mined' these days.
By looking at projects from the past in the field of NLP (being part of AI/computer science departments), this chapter will attempt to highlight their attitudes towards written text.
<div style="color:gray;">(The systemization of language through key-value pairs follows a Western tradition that is based on the convention that there is an individual that perceives, and there is an outsie world that contains its own natural truth. The linguist Austin rather shows that language is merely a speech act, happening as a social act. In these speech acts there is no such external objective meaning of the words we use in language, that meaning only exists in social relations. Heidegger even goes further, and says that we should not split up the object from the predicate. For example: while 'hammering' the person that hammers is not regarding the hammer in a reflective sense. The person is in the moment of using the hammer to achieve something. The only moment when the person would be confronted with a representational sense of the hammer, is when the hammer breaks down. It is at that moment that the person will need to fix the hammer, and learn a bit more about what a hammer 'is'.)</div>
==3 - circularity (reasoning)==
A text mining process is not aiming to 'find' if assumptions can be confirmed, it is rather engineered to be so.
*  text mining & an iterating workflow
** Pattern's workflow [[User:Manetta/i-could-have-written-that/knowledge-discovery-workflow | more here]]
** IBM's 'iterating' graph
*  heliocentric/geocentric diagrams to understand the universe → complexity confirms the input (here: the perception of stars and their positions) Could this be the case in mining practices?




Line 89: Line 48:


==bibliography (five key texts)==
==bibliography (five key texts)==
* Matthew Fuller - short presentation of the poem: Blue Notebook #10 / The Red-haired Man, during Ideographies of Knowledge, Mundaneum, Mons (Oct. 2015); [http://pzwart1.wdka.hro.nl/~manetta/i-could-have-written-that/elements/the-red-haired-man/the-red-haired-man.html annotations]
* Joseph Weizenbaum - Computer Power and Human Reason: From Judgement to Calculation (1976);
* Joseph Weizenbaum - Computer Power and Human Reason: From Judgement to Calculation (1976);
* Winograd + Flores - Understanding Computers & Cognition (1987);
* Winograd + Flores - Understanding Computers & Cognition (1987);
Line 95: Line 53:
* Antoinette Rouvroy - All Watched Over By Algorithms - Transmediale (Jan. 2015); [http://pzwart1.wdka.hro.nl/~manetta/annotations/html/events%2btalks/transmediale_all-watched-over-by-algorithms_2015.html &rarr; annotations]
* Antoinette Rouvroy - All Watched Over By Algorithms - Transmediale (Jan. 2015); [http://pzwart1.wdka.hro.nl/~manetta/annotations/html/events%2btalks/transmediale_all-watched-over-by-algorithms_2015.html &rarr; annotations]
* The Journal of Typographic Research - OCR-B: A Standardized Character for Optical Recognition this article (V1N2) (1967); [http://pzwart1.wdka.hro.nl/~manetta/i-could-have-written-that/elements/automatic-reading-machines/automatic-reading-machines.html &rarr; abstract]
* The Journal of Typographic Research - OCR-B: A Standardized Character for Optical Recognition this article (V1N2) (1967); [http://pzwart1.wdka.hro.nl/~manetta/i-could-have-written-that/elements/automatic-reading-machines/automatic-reading-machines.html &rarr; abstract]
{{#widget:YouTube|id=JFgsdzikVZU}}


==annotations==
==annotations==
Line 127: Line 87:
==other==
==other==
[[User:Manetta/thesis/thesis-outline-nlp | outline-thesis (2) &rarr; NLP]]
[[User:Manetta/thesis/thesis-outline-nlp | outline-thesis (2) &rarr; NLP]]
------------------------------
[[User:Manetta/thesis/thesis-in-progress | thesis in progress (overview)]]
[[User:Manetta/thesis/chapter-intro | intro &+]]
[[User:Manetta/thesis/chapter-1 | chapter 1]]
[[User:Manetta/thesis/chapter-2 | chapter 2]]
[[User:Manetta/thesis/chapter-3 | chapter 3]]

Latest revision as of 15:09, 30 April 2016

outline - i could have written that

intro

text analytics < > systemization of language

This text originates from an interest in the systemization of language that is needed for computer software to be able to 'understand' and process written language.

vocabulary

  • buzzwords (machine learning, big data, data mining) (ref to Florian Cramer)
  • metaphor (too much ???)
  • one of the five KDD steps

problematic situation

...

Text mining seems to be a rather brutal way to deal with the aim to process natural language into useful information. To reflect on this brutality, tracing back a longer tradition of natural language processing could be usefull. Hopefully this will be a way to create some distance to the hurricanes of data that are mainly known as 'big', 'raw' or 'mined' these days.

audience

This thesis will aim for an audience that is interested in an alternative perspective on buzzwords like 'big data' and 'data-mining'. Also, this thesis will (hopefully!) offer a view from a computer-vision side: how software is written to understand the non-computer world of written text.

hypothesis

The results of data-mining software are not mined, results are constructed.

chapter 1: on what basis? three settings to highlight differences in text analytical ideologies

  • setting 1: PhD candidate's thesis defence, Faculty of Economics, Erasmus University Rotterdam
  • setting 2: Lyle Unger's TED Talk, World Well Being Project, Faculty of Psychology, University of Pennsylvania
  • setting 3: Guy de Pauw's introduction on text mining software, CLiPS, Faculty of Arts & Philosophy, Computational Linguistics & Psycholinguistics department, University of Antwerp

chapter 2: deriving information from written text → the material form of language

  • statistical text analytics is not 'read-only', it's writing
    • to extract? → to derive
  • written language as source material
    • analogy to typography, dealing with the optical materiality of words/sentences/text
    • text analytics dealing with the quantifiable and structural materiality of words/sentences/text
      • word-counts
      • word-order/structure
  • what do these material analyses represent?
    • key-value format (?)

chapter 3: information extraction / text categorization. diving into the software!

  • unsupervised
  • supervised


material

bibliography (five key texts)

  • Joseph Weizenbaum - Computer Power and Human Reason: From Judgement to Calculation (1976);
  • Winograd + Flores - Understanding Computers & Cognition (1987);
  • Vilem Flusser - Towards a Philosophy of Photography (1983); → annotations
  • Antoinette Rouvroy - All Watched Over By Algorithms - Transmediale (Jan. 2015); → annotations
  • The Journal of Typographic Research - OCR-B: A Standardized Character for Optical Recognition this article (V1N2) (1967); → abstract

annotations

  • Alan Turing - Computing Machinery and Intelligence (1936)
  • The Journal of Typographic Research - OCR-B: A Standardized Character for Optical Recognition this article (V1N2) (1967); → abstract
  • Ted Nelson - Computer Lib & Dream Machines (1974);
  • Joseph Weizenbaum - Computer Power and Human Reason (1976); → annotations
  • Water J. Ong - Orality and Literacy (1982);
  • Vilem Flusser - Towards a Philosophy of Photography (1983); → annotations
  • Christiane Fellbaum - WordNet, an Electronic Lexical Database (1998);
  • Charles Petzold - Code, the hidden languages and inner structures of computer hardware and software (2000); → annotations
  • John Hopcroft, Rajeev Motwani, Jeffrey Ullman - Introduction to Automata Theory, Languages, and Computation (2001);
  • James Gleick - The Information, a History, a Theory, a Flood (2008); → annotations
  • Matthew Fuller - Software Studies. A lexicon (2008);
  • Marissa Meyer - the physics of data, lecture (2009); → annotations
  • Matthew Fuller & Andrew Goffey - Evil Media (2012); → annotations
  • Antoinette Rouvroy - All Watched Over By Algorithms - Transmediale (Jan. 2015); → annotations
  • Benjamin Bratton - Outing A.I., Beyond the Turing test (Feb. 2015) → annotations
  • Ramon Amaro - Colossal Data and Black Futures, lecture (Okt. 2015); → annotations
  • Benjamin Bratton - On A.I. and Cities : Platform Design, Algorithmic Perception, and Urban Geopolitics (Nov. 2015);

currently working on

* terminology: data 'mining'
* Knowledge Discovery in Data (KDD) in the wild, problem formulations
* KDD, applications
* KDD, workflow
* text-processing: simplification
* list of data mining parties

other

outline-thesis (2) → NLP


thesis in progress (overview)

intro &+

chapter 1

chapter 2

chapter 3