User:Manetta/thesis/thesis-outline: Difference between revisions

From XPUB & Lens-Based wiki
No edit summary
No edit summary
 
(21 intermediate revisions by the same user not shown)
Line 1: Line 1:
<div style="width:100%;max-width:800px;">
<div style="width:100%;max-width:800px;">
=outline=
=outline - i could have written that=
 
==intro==
==intro==
===NLP===
With 'i-could-have-written-that' i would like to look at technologies that process natural language (NLP). There is a range of different expectations of NLP systems, but a full-coverage of natural language is unlikely. By regarding NLP software as cultural objects, i'll focus on the inner workings of their technologies: what are the technical and social mechanisms that systemize our natural language in order for it to be understood by a computer?
NLP is a category of software packages that is concerned with the interaction between human language and machine language. NLP is mainly present in the field of computer science, artificial intelligence and computational linguistics. On a daily basis people deal with services that contain NLP techniques: translation engines, search engines, speech recognition, auto-correction, chatbots, OCR (optical character recognition), license plate detection, data-mining. For 'i-could-have-written-that', i would like to place NLP software central, not only as technology but also as a cultural object, to reveal in which way NLP software is constructed to understand human language, and what side-effects these techniques have.
===knowledge discovery in data (data-mining)===
For the occassion of the graduating project of this year, i would like to focus on the practise of text-mining, which is a subgroup of the so called field of 'data mining'.


===text analytics < > systemization of language===
This text originates from an interest in the systemization of language that is needed for computer software to be able to 'understand' and process written language.


=title: i could have written that=
===vocabulary===
Text mining is part of an analytical practise of searching for patterns in text following "A Data Driven Approach", and assigning these to (predefined) profiles. It is part of a bigger information-construction process (called: Knowledge Discovery in Data, KDD) which implies source-selection, data-creation, simplification, translation into vectors, and testing techniques.
* buzzwords (machine learning, big data, data mining) (ref to Florian Cramer)
* metaphor (too much ???)
* one of the five KDD steps


== context ==
===problematic situation===
Text mining is a political sensitive technology, closely related to surveillance and privacy discussions around 'big data'. The technique in the middle of tense discussions about capturing people's behavior for security reasons, but affect the privacy of a lot of people — accompanied by an unpleasant controlling force that seems to be omnipresent. After the disclosures of the NSA's data capturing program by Edward Snowden in 2013, a wider public became aware of the silent data collecting activities done by a governmental agency on for example [http://www.popularmechanics.com/military/a9465/nsa-data-mining-how-it-works-15910146/ phone-metadata]. The UK law made special exceptions in their copyright laws to make [https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/375954/Research.pdf text mining practises possible] on intellectual property for non-commercial use since October 2014. Problematic is the skewed balance between data-producer and data-analytics also framed as [http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2709498 'data colonialism'], and the accompanied governmental-role that gives to data-analytics for example by construction your search-results-list according to your data-profile.  
...


The magical effects of text mining results, caused by the difficulty of understanding the construction of these results, makes it difficult to formulate an opinion about text mining techniques. It makes it even difficult to formulate what the problem exactly is, as many people are tending to agree with the calculations and word-counts that seemly are excecuted. "What is exactly the problem?", and "This is the data that speaks, right?", are questions that need to be challenged in order to have a conversation about text mining techniques at all.  
Text mining seems to be a rather brutal way to deal with the aim to process natural language into useful information. To reflect on this brutality, tracing back a longer tradition of natural language processing could be usefull. Hopefully this will be a way to create some distance to the hurricanes of data that are mainly known as 'big', 'raw' or 'mined' these days.


This thesis will attempt to show the subjectivity that is present in text mining software, by zooming into the construction process of the creation of so called 'vectors', a main element in the process where text and numbers meet.
===audience===
This thesis will aim for an audience that is interested in an alternative perspective on buzzwords like 'big data' and 'data-mining'. Also, this thesis will (hopefully!) offer a view from a computer-vision side: how software is written to understand the non-computer world of written text.


==hypothesis==
==hypothesis==
The results of data-mining software are not mined, results are constructed. <br>
The results of data-mining software are not mined, results are constructed. <br>


== reading technique: a 'vector' ==
==chapter 1: on what basis? three settings to highlight differences in text analytical ideologies==
Data is seen as a material that easily can be extracted from the web, and regarded to be that little 'truth-snapshot' that is taken at a certain time and moment. In text mining, the material that is used as input are written pieces of texts transformed into data. The data consists of a word + number, a feature + weight, a key + value pair, or how a linguist would call it: a subject + predicate combination. This is the material of the building block where text mining outcomes are constructed with. Immateriality building blocks are: point-of-departures, source-selection, noise-reduction, and test-techniques. All these elements effect the vector. (In that way it could be understood as a cybernetic system of control.)
* setting 1: PhD candidate's thesis defence, Faculty of Economics, Erasmus University Rotterdam
* setting 2: Lyle Unger's TED Talk, World Well Being Project, Faculty of Psychology, University of Pennsylvania
* setting 3: Guy de Pauw's introduction on text mining software, CLiPS, Faculty of Arts & Philosophy, Computational Linguistics & Psycholinguistics department, University of Antwerp


A vector implies the moment of representing a word with a number. This number can represent a wordcount, ..., ...).
==chapter 2: deriving information from written text &rarr; the material form of language==
* statistical text analytics is not 'read-only', it's writing
** to extract? &rarr; to derive
* written language as source material
** analogy to typography, dealing with the optical materiality of words/sentences/text
** text analytics dealing with the quantifiable and structural materiality of words/sentences/text
*** word-counts
*** word-order/structure
* what do these material analyses represent?
** key-value format (?)


== systemization of language ==
==chapter 3: information extraction / text categorization. diving into the software!==
The systemization of language is needed to fullfill an aim of developing software that processes large amounts of text, and are able to 'read' their content. Where could this aim come from? The linguist Austin shows that language is merely a speech act, happening as a social act. In these speech acts, there is no such external objective meaning of the words we use in language. Heidegger even goes further, and says that while 'hammering' the person that hammers is not regarding the hammer in a reflective sense. The person is in the moment of using the hammer to achieve something. The only moment when the person would be confronted with a representational sense of the hammer, is when the hammer breaks down. It is at that moment that the person will learn a bit more about what a hammer 'is'.
* unsupervised
* supervised






=material=


==bibliography (five key texts)==
* Joseph Weizenbaum - Computer Power and Human Reason: From Judgement to Calculation (1976);
* Winograd + Flores - Understanding Computers & Cognition (1987);
* Vilem Flusser - Towards a Philosophy of Photography (1983); [http://pzwart1.wdka.hro.nl/~manetta/annotations/html/txt/vilem-flusser_towards-a-philosophy-of-photography.html &rarr; annotations]
* Antoinette Rouvroy - All Watched Over By Algorithms - Transmediale (Jan. 2015); [http://pzwart1.wdka.hro.nl/~manetta/annotations/html/events%2btalks/transmediale_all-watched-over-by-algorithms_2015.html &rarr; annotations]
* The Journal of Typographic Research - OCR-B: A Standardized Character for Optical Recognition this article (V1N2) (1967); [http://pzwart1.wdka.hro.nl/~manetta/i-could-have-written-that/elements/automatic-reading-machines/automatic-reading-machines.html &rarr; abstract]


{{#widget:YouTube|id=JFgsdzikVZU}}


 
==annotations==
 
==project & thesis (merge)==
<small>voice: accessible for a wider public </small><br>
<small>needed: problem formulations that connect with day-to-day life </small>
 
As 'i-could-have-written-that' is driven by textual research, it would feel quite natural to merge the practical and written (reflective) elements of the graduation procedure into one project. Also, as the eventual format i have in mind at the moment is a publication series, that could bring the two together. Next to written reflections on the hypothesis of constructed results, i would like to work on hands-on prototypes with text-mining software.
 
As a work method, i would like to isolate and analyse different data-mining elements to test the hypothesis on. The elements selected so far focus on: terminology (metaphors + history), software (data construction + ... ), and presentation of results.
 
==data mining elements==
[[File:Text-mining-technical-process.png|right|thumbnail|text-mining software Pattern, workflow diagram]]
* terminology ('mining', 'data')
** 'mining' &rarr; from 'mining' minerals to 'mining' data; [[User:Manetta/i-could-have-written-that/from-mining-minerals-to-mining-data | (wiki-page)]]
*** 'data mining' & mining natural resrouces
*** 'data mining' & archeology
*** 'data mining' & writing
** 'data' &rarr; data as autonomous entity; from: information, to: data science
* text-processing
** from: able to check results with senses (OCR), to: intuition (data-mining)
** parsing, how text is treated: as n-grams, chunks, bag-of-words, characters
* workflow mining-software (eg. Pattern, Weka); (software workflow diagram) & circularity
* Knowledge Discovery in Data (KDD) workflow & circularity
** prototype: how different aims 'read' the data according to their perspective ... (recognizing patterns in a game of chance)
* presentation of results
 
===theory===
* solutionism & techno optimism
* big-data, machine learning & data-mining criticism
*
 
=research material=
[http://pzwart1.wdka.hro.nl/~manetta/i-could-have-written-that/ &rarr; filesystem interface, collecting research related material] [[User:Manetta/i-could-have-written-that/filesystem-interface-related-material | (+ about the workflow)]]<br>
[[User:Manetta/i-could-have-written-that | &rarr; wikipage for 'i-could-have-written-that' (list of prototypes & inquiries)]] <br>
[[User:Manetta/i-could-have-written-that/little-glossary | &rarr; little glossary]]<br>
 
===mining as ideology===
[[User:Manetta/i-could-have-written-that/from-mining-minerals-to-mining-data | * from mining minerals to mining data]]<br>
 
'''anthropomorphism'''
 
[[User:Manetta/i-could-have-written-that/anthropomorphic-qualities | * anthropomorphic qualities of a computer (?)]]<br>
[[User:Manetta/i-could-have-written-that/the-data-apparatus | * the photographic apparatus &rarr; the data apparatus (annotations)]] <br>
[http://pzwart1.wdka.hro.nl/~manetta/i-could-have-written-that/elements/joseph-s_questions/joseph-s_questions.html * Joseph's (Weizenbaum) questions on Computer Power and Human Reason]<br>
 
===text processing===
[http://pzwart1.wdka.hro.nl/~manetta/i-could-have-written-that/elements/semantic-math-averaging/semantic-math-averaging.html * semantic math: averaging polarity rates in Pattern (text mining software package)]<br>
[[User:Manetta/i-could-have-written-that/wordclouds | * notes on wordclouds]]<br>
[http://pzwart1.wdka.hro.nl/~manetta/i-could-have-written-that/elements/automatic-reading-machines/automatic-reading-machines.html * automatic reading machines; from encoding-decoding to constructed-truths]<br>
[http://pzwart1.wdka.hro.nl/~manetta/i-could-have-written-that/elements/wordnet-skeleton/wordnet-skeleton.html * index of WordNet 3.0 (2006)]<br>
 
===data as autonomous entity===
[http://pzwart1.wdka.hro.nl/~manetta/i-could-have-written-that/elements/knowlegde-driven-by-the-data/knowlegde-driven-by-the-data.html * knowledge driven by data - ''whenever i fire a linguist, the results improve'']<br>
 
===other===
[http://pzwart1.wdka.hro.nl/~manetta/i-could-have-written-that/elements/i-am-sorry-but-these-are-the-words-laughter/i-am-sorry-but-these-are-the-words-laughter.html * (laughter) - ''it's embarrassing but these are the words'']<br>
[[User:Manetta/i-could-have-written-that/syntactic-view | * call for a syntactic view; Florian Cramer & Benjamin Bratton (text)]] <br>
[[User:Manetta/i-could-have-written-that/sentiment-analysis-phd-presentation | * EUR PhD presentation 'Sentiment Analysis of Text Guided by Semantics and Structure' (13-11-2015) ]]<br>
[http://pzwart1.wdka.hro.nl/~manetta/i-could-have-written-that/elements/roget-s_thesaurus-of-english-words-and-phrases/roget-s_thesaurus-of-english-words-and-phrases.html * index of Roget's thesaurus (1805)]<br>
[http://pzwart1.wdka.hro.nl/~manetta/i-could-have-written-that/elements/classification_what-happened_roget---wordnet/classification_what-happened_roget---wordnet.html * comparing the classification of the word 'information' Thesaurus (1911) vs. WordNet 3.0 (2006)]<br>
 
 
=annotations=
* Alan Turing - Computing Machinery and Intelligence (1936)
* Alan Turing - Computing Machinery and Intelligence (1936)
* The Journal of Typographic Research - OCR-B: A Standardized Character for Optical Recognition this article (V1N2) (1967); [http://pzwart1.wdka.hro.nl/~manetta/i-could-have-written-that/elements/automatic-reading-machines/automatic-reading-machines.html &rarr; abstract]
* The Journal of Typographic Research - OCR-B: A Standardized Character for Optical Recognition this article (V1N2) (1967); [http://pzwart1.wdka.hro.nl/~manetta/i-could-have-written-that/elements/automatic-reading-machines/automatic-reading-machines.html &rarr; abstract]
Line 120: Line 77:
* Benjamin Bratton - [https://vimeo.com/145288035 On A.I. and Cities : Platform Design, Algorithmic Perception, and Urban Geopolitics] (Nov. 2015);
* Benjamin Bratton - [https://vimeo.com/145288035 On A.I. and Cities : Platform Design, Algorithmic Perception, and Urban Geopolitics] (Nov. 2015);


==currently working on==
[[User:Manetta/i-could-have-written-that/from-mining-minerals-to-mining-data | * terminology: data 'mining']]<br>
[[User:Manetta/i-could-have-written-that/data-mining-in-the-wild | * ''Knowledge Discovery in Data'' (KDD) in the wild, problem formulations]]<br>
[[User:Manetta/i-could-have-written-that/kdd-applications | * ''KDD'', applications]]<br>
[[User:Manetta/i-could-have-written-that/knowledge-discovery-workflow | * ''KDD'', workflow]]<br>
[[User:Manetta/i-could-have-written-that/text-processing/simplification | * text-processing: simplification]]<br>
[[User:Manetta/i-could-have-written-that/data-mining-parties | * list of data mining parties]]<br>
==other==
[[User:Manetta/thesis/thesis-outline-nlp | outline-thesis (2) &rarr; NLP]]


=bibliography (five key texts)=
------------------------------
* Vilem Flusser - Towards a Philosophy of Photography (1983); [http://pzwart1.wdka.hro.nl/~manetta/annotations/html/txt/vilem-flusser_towards-a-philosophy-of-photography.html &rarr; annotations]
 
* Language, Florian Cramer (2008); [http://pzwart1.wdka.hro.nl/~manetta/annotations/html/txt/florian-cramer_language.html &rarr; annotations]
[[User:Manetta/thesis/thesis-in-progress | thesis in progress (overview)]]
* Antoinette Rouvroy - All Watched Over By Algorithms - Transmediale (Jan. 2015); [http://pzwart1.wdka.hro.nl/~manetta/annotations/html/events%2btalks/transmediale_all-watched-over-by-algorithms_2015.html &rarr; annotations]
 
* The Journal of Typographic Research - OCR-B: A Standardized Character for Optical Recognition this article (V1N2) (1967); [http://pzwart1.wdka.hro.nl/~manetta/i-could-have-written-that/elements/automatic-reading-machines/automatic-reading-machines.html &rarr; abstract]
[[User:Manetta/thesis/chapter-intro | intro &+]]
*
 
[[User:Manetta/thesis/chapter-1 | chapter 1]]
 
[[User:Manetta/thesis/chapter-2 | chapter 2]]
 
[[User:Manetta/thesis/chapter-3 | chapter 3]]

Latest revision as of 16:09, 30 April 2016

outline - i could have written that

intro

text analytics < > systemization of language

This text originates from an interest in the systemization of language that is needed for computer software to be able to 'understand' and process written language.

vocabulary

  • buzzwords (machine learning, big data, data mining) (ref to Florian Cramer)
  • metaphor (too much ???)
  • one of the five KDD steps

problematic situation

...

Text mining seems to be a rather brutal way to deal with the aim to process natural language into useful information. To reflect on this brutality, tracing back a longer tradition of natural language processing could be usefull. Hopefully this will be a way to create some distance to the hurricanes of data that are mainly known as 'big', 'raw' or 'mined' these days.

audience

This thesis will aim for an audience that is interested in an alternative perspective on buzzwords like 'big data' and 'data-mining'. Also, this thesis will (hopefully!) offer a view from a computer-vision side: how software is written to understand the non-computer world of written text.

hypothesis

The results of data-mining software are not mined, results are constructed.

chapter 1: on what basis? three settings to highlight differences in text analytical ideologies

  • setting 1: PhD candidate's thesis defence, Faculty of Economics, Erasmus University Rotterdam
  • setting 2: Lyle Unger's TED Talk, World Well Being Project, Faculty of Psychology, University of Pennsylvania
  • setting 3: Guy de Pauw's introduction on text mining software, CLiPS, Faculty of Arts & Philosophy, Computational Linguistics & Psycholinguistics department, University of Antwerp

chapter 2: deriving information from written text → the material form of language

  • statistical text analytics is not 'read-only', it's writing
    • to extract? → to derive
  • written language as source material
    • analogy to typography, dealing with the optical materiality of words/sentences/text
    • text analytics dealing with the quantifiable and structural materiality of words/sentences/text
      • word-counts
      • word-order/structure
  • what do these material analyses represent?
    • key-value format (?)

chapter 3: information extraction / text categorization. diving into the software!

  • unsupervised
  • supervised


material

bibliography (five key texts)

  • Joseph Weizenbaum - Computer Power and Human Reason: From Judgement to Calculation (1976);
  • Winograd + Flores - Understanding Computers & Cognition (1987);
  • Vilem Flusser - Towards a Philosophy of Photography (1983); → annotations
  • Antoinette Rouvroy - All Watched Over By Algorithms - Transmediale (Jan. 2015); → annotations
  • The Journal of Typographic Research - OCR-B: A Standardized Character for Optical Recognition this article (V1N2) (1967); → abstract

annotations

  • Alan Turing - Computing Machinery and Intelligence (1936)
  • The Journal of Typographic Research - OCR-B: A Standardized Character for Optical Recognition this article (V1N2) (1967); → abstract
  • Ted Nelson - Computer Lib & Dream Machines (1974);
  • Joseph Weizenbaum - Computer Power and Human Reason (1976); → annotations
  • Water J. Ong - Orality and Literacy (1982);
  • Vilem Flusser - Towards a Philosophy of Photography (1983); → annotations
  • Christiane Fellbaum - WordNet, an Electronic Lexical Database (1998);
  • Charles Petzold - Code, the hidden languages and inner structures of computer hardware and software (2000); → annotations
  • John Hopcroft, Rajeev Motwani, Jeffrey Ullman - Introduction to Automata Theory, Languages, and Computation (2001);
  • James Gleick - The Information, a History, a Theory, a Flood (2008); → annotations
  • Matthew Fuller - Software Studies. A lexicon (2008);
  • Marissa Meyer - the physics of data, lecture (2009); → annotations
  • Matthew Fuller & Andrew Goffey - Evil Media (2012); → annotations
  • Antoinette Rouvroy - All Watched Over By Algorithms - Transmediale (Jan. 2015); → annotations
  • Benjamin Bratton - Outing A.I., Beyond the Turing test (Feb. 2015) → annotations
  • Ramon Amaro - Colossal Data and Black Futures, lecture (Okt. 2015); → annotations
  • Benjamin Bratton - On A.I. and Cities : Platform Design, Algorithmic Perception, and Urban Geopolitics (Nov. 2015);

currently working on

* terminology: data 'mining'
* Knowledge Discovery in Data (KDD) in the wild, problem formulations
* KDD, applications
* KDD, workflow
* text-processing: simplification
* list of data mining parties

other

outline-thesis (2) → NLP


thesis in progress (overview)

intro &+

chapter 1

chapter 2

chapter 3