User:Manetta/thesis/chapter-1: Difference between revisions

From XPUB & Lens-Based wiki
No edit summary
 
(5 intermediate revisions by the same user not shown)
Line 1: Line 1:
<div style="width:750px;">
<div style="width:750px;">
__TOC__
__TOC__
=i could have written that - chapter 1=
=chapter 1 - raw language=


==on what basis? three settings to highlight differences in text analytical ideologies==
<div style="width:350px;">
''There lived a red-haired-man who had no eyes or ears. Neither did he have any hair, so he was called red-haired theoretically.''


===Setting 1===
''He couldn't speak, since he didn't have a mouth. Neither did he have a nose. He didn't even have any arms or legs. He had no stomach and he had no back and he had no spine and he had no innards whatsoever. He had nothing at all!''
[[File:EUR-PhD-defence-sentiment-mining.JPG|thumb|left]]
It is a wet Friday afternoon in mid November 2015. A woman dressed in traditional academic garment enters the lecture room. The people in the audience stand up from their seats. The woman carries a stick with bells. They tinkle softly. It seems to be her way to tell everyone to be silent and stay focused on that what is coming. A group of two woman and seven man dressed in toga's follow her. They walk to their seats behind the jury table. The doctoral candidate of the Economy department starts his defense. He introduces his area of research with a short description of the increasing amount of information that is published on the internet these days in the form of text. Text that could be used to extract information about the reputation of a company. It is important for decision makers to know how the public feels about their products and services. Online written material such as reviews, are a very useful source for companies to extract that information from. How could this be done? The candidate illustrates his methododology with an image of a branch with multiple leaves. When looking at the leaves, one could order them by color, or shape. Such ordering techniques can be applied to written language as well: by analyzing and sorting words. The candidate's topic of research has been 'sentiment analysis in written texts'. This nothing new. Sentiment analysis is a common tool in the text mining field. The candidate's aim is to improve this technique. He proposes to detect emoticons as sentiment, and to add more weight to more important segments of a sentence.


One of the professors opens the discussion on a critical tone. He asks the candidate to his definition of the word 'sentiment'. The candidate replies by saying that sentiment is what people intend to convey. There is the sentiment that the reader perceives, and there is sentiment that the writer conveys. In the case of reviews, sentiment is a judgment. The professor states that the candidate only used the values '-1' and '+1' to describe sentiment in his thesis, which is not a definition. The professor continues by asking if the candidate could offer a theory where the thesis has been based on. But there is again no answer that fulfills the professor's request. The professor claims that the candidate's thesis only presents numbers, no definitions.
''Therefore there's no knowing whom we are even talking about. In fact it's better that we don't say any more about him.''


Another professor continues and asks for the 'neutral backbone' that is used in the research to validate the sentiment of certain words. Did the candidate collaborate with psychologists for example? The candidate replies that he collaborated with a company that manually rated the sentiment values of words and sentences. He cannot give a description about how that annotation process has been executed. The professor highlights the importance of an external backbone that is needed in order to be able to give results. Which brings him to his next question. The professor counted 6000 calculations that had been done to confirm the candidate's hypothesis. This 'hypothesis testing' phenomenon is a recurring element in the arguments of the thesis. The candidate is asked if he wasn't over-enthusiastic in his results.
<small>Blue Notebook #10 / The Red-haired Man, by Daniil Ivanovich Kharms, (Translated by Neil Cornwall), written on the 7th of January, 1931; via Matthew Fuller during Ideographies of Knowledge, Mundaneum, 03-10-2015, Mons</small>
</div>


But the jury must also admit that it is quite an achievement to write a doctoral thesis on the topic of text mining at a university where there is no department of linguistics, and neither in computer science. The candidate was located at the Economics department under the 'Erasmus Research Institute of Management'. Though, when the candidate was asked about his plans to fix the gaps in his thesis, he replied with saying that he already had a job in the business, and rewriting his thesis would not be a priority nor his primary interest.  
==a non-man==
The poem above describes a man, who had no mouth, no nose, nor arms or legs. He had no back, no spine and no innards whatsoever. He actually had nothing at all. All the things that we know of that are needed to recognize a man as such, are absent. The little two-letter word 'no' denies all characteristics that a normal man would have. And before the short story can continue to speak about the man in question, he has disappeared.  


The poem is written by the poet Daniil Kharms in the Soviet Union in the early 1931. Daniil Kharms wrote this short story in a blue notebook, and he placed it as the tenth piece of writing in between other short stories covering the contrast of being rich/poor or smart/stupid. In the line of these identity explorations the red-haired-man echoes a quest for manliness, or even in a more general sense: a quest for being.


===Setting 2===
The specific descriptions of the red-haired-man introduce him as a figure which starts to appear but disappears at the very same moment. By removing the man's qualities while introducing him, he becomes a paradox: a non-man. While the non-man is a type of man that contradicts with itself, it is impossible to say that this man does not exist at all.  
2:11 Now let's start with something that's relatively clear,
2:14 and let's see if it makes sense.  
2:15 See the words that are most typical, most discriminative, most predictive,
2:21 of being female.
2:23 (Laughter)
2:30 Yeah, it's a little bit embarrassing, I'm sorry, but I didn't make this up!
2:34 It's very cliché, but these are the words.


[[File:TED-talk-screenshot_Lyle-Unger-World-Well-Being-Project_predicting-heartdiseases-using-Twitter.png|thumb|left]]


The video reaches the 2:11 minutes when Lyle Ungar starts his introduction to the first text mining results that he will present his TED audience tonight. Luckily enough he can start with something that is relatively clear: the words that are most typical, most discriminative, most predictive, of being female. Nothing too complicated to start with. He proposes to team up with his audience, to see together if the outcomes make sense. Only 10 seconds later Lyle hits the button on the TED slide remote controller. In the reflection of his glasses appears a bright semblance. The audience slowly starts to titter. Lyle looks up to face his audience. He frowns, turns his head, walks a few steps to the right and sights theatrical and somewhat too loud. While Lyle's posture speaks the language of shameful soreness, a white slide with colorful words appears on the screen. '<3' is typeset in the largest font size, followed by 'excited', 'shopping', 'love you' and 'my hair', surrounded by another +-50 words that together form the shape of a cloud. '(Laughter)', appears in the subtitles. The audience seems to recognize the words, and responds to them with a stiffled laughter. Is it the term 'shopping' that appears so big that is funny? Because it confirms a stereotype? Or is it surprising to see what extreme expressions appear to be typical for being female? Lyle had seen it coming, and quickly excuses himself for the results by saying: I didn't make this up! It's very cliché, but these are the words.
The red-haired-man is stripped from all his characteristics and presented to the reader in a state of full bareness, not in the literal sense of a naked man, but rather in the way that we can ask ourselves if the man can even still exist as a fictional character. If all the man's attributes are erased, what is then still left of him that we can speak about?


These results are part of a text mining research project of the University of Pennsylvania called the 'World Well Being Project' (WWBP). The project is located at the 'Positive Psychology Center', and aims to measure psychological well-being and physical health by analyzing written language on social media. For the results that Lyle Ungar presented at the TED presentations in Pennsylvania 2015, a group of 66.000 Facebook users were asked to share their messages and posts with the research group, together with their age and gender. They were also asked to fill in the 'big five personality test'. A widely used questionnaire that is used by psychologists to describe human personalities and returns a value for 'openness', 'conscientiousness', 'extraversion', 'agreeableness', and 'neuroticism'. Text mining here is used as a technique to derive information about Facebook users by connecting their word usage to their age, gender and personality profile.  
==text as data==
This thought-experiment makes it possible to shine a light on a material that is subjected to a similar way of being stripped of its characteristics: text that is used as data.
Already since the very early emergence of the computer [REF], Turing's article on the Turing test in 1950 [REF] and Weizenbaum's psychotherapist chat-bot ELIZA around 1965 [REF], computer scientists worked on applications that process natural language into a format that could be processed by a computer. Also linguists became interested to approach language from a computational perspective which formed the field of computational linguistics.  
In the last few years, roughly since 2012 [REF], the field of natural language processing (NLP) merged with a particular hype around the possession of data, a specific culture with high aims and a strong belief in statistical computation which has been labeled with buzzwords like 'big data', 'raw data', 'text mining', 'machine learning' and more recently with the even more mystifying term 'deep learning'. These techniques attempt to find patterns in a set of data, by measuring and calculating similarities between different sets of data. Ans since social media messages, blogposts, news articles or emails are regarded as a resources for useful and valuable information, data analysts aimed to measure written text as well. By looking at specific word-occurrences and grammatical structures, the data analyst attempts to measure a text's characteristics, like sentiment, violence, certainty, vagueness, subjectivity, factuality, depression, or degrees of irony and sarcasm. But to be able to measure these qualities in a text, the paragraphs, sentences and words need to transform into data, which basically means: they need to turn into numbers.
Where and how text is transformed into data is not something that is standardized or fixed. There are many ways to process a document of written text into something that can be measured and calculated.  




===Setting 3===
==text processing==
[[File:Cqrrelations_Guy-de-Pauw-CLiPS-Pattern-introduction_small.jpg|thumb|left]]
* split (tokenize)
A common way to make a text processable by the computer is by splitting it up in smaller parts. The split function is a basic function that is included in i.e. the programming language python, which takes a sentence or text as an input (in string format), splits the text by default on the whitespaces, and return a list. More comprehensive software contain a stronger 'tokenizer' variant, which is written to also detect punctuation, abbreviations and short linguistic elements as a 's or 'd in English. After the split function, the text is no longer a continuous waterfall of words in sentences in paragraphs, but a list of words. This list is a format which is in line with the nature of a computer, which now can -- with the help of i.e. python or Libre Office -- sort the list in for example alphabetical order.
The split function enables to process a text in a very basic but powerful way. It transforms into data: a list of words.  


Guy de Pauw is in the middle of his presentation, when he calls text mining a technology of shallow understanding. It is a cold week in mid January 2015. The room is filled with 40 artists, researchers, designers, activists, students (among others), of which most are interested in, or working with free software. A lot of the people sit with laptops on their laps, trying to keep up with the speed and amount of information. Not many people in the audience are familiar with text mining techniques, and Guy's presentation is full of text mining jargon. To make as many notes as possible seems to be the best strategy for the moment. In the meanwhile, Guy formulates the fundamental problems that text mining is facing: how to transform text from form to meaning? How to deal with semantics and meaning? And, how can a computer 'understand' natural language without any world knowledge? It is telling how much effort Guy takes to show the problematic points in text understanding practices. In one of his next slides, Guy shows an image where one sentence is interpreted in five different ways. Each version of the sentence pretty little girl's school is illustrated to reveal the different meanings that this short sentence contains. Guy transcribes shortly: “Version one: the pretty school for little girls. Version two: the seemingly little girl and her school. Version three: the beautiful little girl and her school. And so forth.”


[[File:From_CLiPS-presentations-during-Cqrrelations_jan-2015_Brussels-Pretty-little-girls-school.png|center]]
* count (bag-of-words)
Now the text has been split, the text is changed into words and other chunks of characters. A technique that transforms them into numbers is by counting the chunks that have exactly the same form. In this way, the form of the chunks is measured and expressed in a number. A technique that is often used in the field of text mining is a more extensive version of the count function, and known as bag-of-words. The bag-of-words function enables to provide a word list of words that should be excluded from the counting, such as stopwords or other often used phrases. Next to that, the bag-of-words function can count relatively, by taking all words that appear in a set of documents into account, and count the relative uniqueness of a specific word. This technique is often used to create a numerical representation of a text, which can be compared to other texts.
After applying one of these counting techniques, the text has again transformed into another type of data: a list of word-number combinations.  


A few minutes earlier, Guy showed an image of two wordclouds that represent words, phrases and topics most highly distinguishing females and males. '<3', 'shopping' and 'excited' are labeled as being most typical female. 'Fuck', 'wishes', and 'he' are presented as most typically 'male'. A little rush of indignation moved through the room. 'But, how?!'. You could see question marks rising above many heads. How is this graph constructed? Where does it come from? Guy explained how he is interested in gender detection in a different sense. In the graph, words were connected to themes and topics, whereupon it is only a small step to speak about 'meaning' and what females 'are'. Guy's next slide showed how he is more interested to look at gender in a grammatical way. By analyzing the structures of sentences that are written by females and comparing these to male-written sentences. Then, all there is to say is: women use more relational language and men more informative language.


Shallow understanding? Guy shows the website 'biograph.be' to illustrate his statement. It is a text mining project where connections are drawn between hypotheses of academic papers. The project can be used for prevention, diagnosis or treatment purposes.  'Automated knowledge discovery' is promised to prevent anyone from 'drowning in information'. Guy adds some critical remarks: using this technology in medial contexts “will lead to a fragmentation of the field” as well as to “poor communication between subfields”.
* tag (part-of-speech, POS)
from sentence to word-types and syntactical structures & disambiguation
(more to come here)


Guy is invited to speak and introduce the group to a text mining software package. The software is called 'Pattern' and developed at the university of Antwerp, where Guy is part of the CLiPS research group: 'Computational Linguistics & Psycholinguistics'. Coming from a linguistic background, the CLiPS research group is approaching their project rather from structural approaches than statistical. This nuance is difficult to grasp when only results are presented. Guy hits a button on his keyboard and his presentation jumps to the next slide. It is an overview of linguistic approaches to text understanding for computers. The slide shows a short bullet-pointed overview. Coming from a knowledge representation approach in the 70s, where sentence structures were described in models that were fed into the computer. Via a knowledge-based approach in the 80s, where corpora were created to recognize sentence structures on a word-level. Word types as 'noun', 'verb' or 'adjective' functioned for example as labels. Towards the period that started in the mid 90s: a statistical and shallow understanding approach. Text understanding became scalable, efficient and robust. Making linguistic models became easier and cheaper. Guy adds immediately a critical remark: is this a phenomenon of scaling up by dumbing down?
==a non-text?==
An obvious question here is when text still can be called a text. But what similarities does the data-version of a text still hold with the written document? And on what points do they now differ after the transformation into a data-object that is created for functional and useful reasons? Or is there a similar process active as Kharms described with the red-haired man? Does text become a non-text as well?
 
 
<div style="width:350px;">
''There was a text which had no layout or sentences. Neither did it have any spaces, so it was called a text theoretically.''
 
''It couldn't communicate anything, since it didn't have a message. Neither did it have an intention. It didn't even have any grammar. It had no subjects and it had no objects and it had no punctuation and it had no previous paragraph whatsoever. It had nothing at all!''
 
''Therefore there's no knowing what we are even talking about. In fact it's better that we don't say any more about it.''
</div>
 
==the non-text paradox, no context==
A stripped, ordered, split and tagged text is loosing many of its characteristics and qualities. Where the man has no ears, eyes, arms, legs, hair, back nor spine, a parsed text does have no sentences, no grammar, no spaces, no punctuation, no page numbers, no reading time, no previous paragraph, no layout, it even has no initial message anymore and no author's intention. There is no subtle tone, no ironic character, no flow of certainty, no rhetoric style, no ambiguous connotation from beautiful rhymes and no shouting voice1. Because the chunks of characters are material objects, which are measurable and calculable, they are placed on the highest stage and depicted as the most bare primary ingredients, the most basic and pure elements of a text.
 
==raw as static status==
These basic elements are often considered as being raw. The idea of raw data is often used in the field of text mining and data analyses to highlight its fundamental function and (too) powerful position. They are assumed to be raw in the sense that minerals are raw, or vegetables are raw. But when you take a closer look on their rawness, it excludes the important part of how they actually became raw initially. Minerals are considered to be raw when they are still under the ground and ready to be mined for and extracted from the earth, but have been developed for many years under the influence of multiple geological processes. Vegetables are considered to be raw in a similar way. When they lie in the supermarket they are raw in the sense of being uncooked, but they have developed from a little seed over a long period of time while adapting a lot of sunlight, water and organic substances from the ground. The metaphor of rawness is often used to refer to data, but is built on an ideal that something that is raw is already there. Raw data implies the illusion that the data is true in itself, and cannot be argued with just because it is there.
 
==ideal of rawness==
Next to disguising the fact that raw data is created within a specific preceding process, rawness also contains a promise of having direct and unmediated access. Antoinette Rouvroy (2015) points towards these side-effects that occur when data is presented as a raw object. She describes how raw data implies a references to data as a natural product: “It is the idea that nature will speak by itself. It is the idea that thanks to big data, the world speaks by itself without any: transcription, symbolization, institutional mediation, political mediation or legal mediation”. If data is considered to be raw, it implies that the data can be extracted from its source without any layer of mediation. This directness that rawness connotates, constructs (again) an objective and truthful position to the data. It facilitates a strong belief that results only need to be extracted from their source, which can be done without any interference of anything or anyone.  
 
==conclusion==
Where in the poem of Daniil Kharms the red-haired-man appeared and disappeared at the same time, he became a paradox: a non-man. The poem discovers the limitations of language by introducing a character that immediately is claimed not to exist. (Yankelevich, 2007) The poem becomes an exercise in which the reader is challenged to decide if the man actually has disappeared. But while a written text slowly transforms into a non-text, by having no reading order, layout, tone and other contextual elements, the text does not disappear. A non-text is no paradox in the sense that it does not contradict itself in the way that Daniil Kharms turned the red-haired-man in a fictional entity without any characteristics. Although a text looses many of its characteristics after splitting, counting or tagging, it does not disappear. Instead while the text becomes a non-text, and slowly transforms into data, it is slowly encapsulated with the ideology of rawness2.




Line 53: Line 82:
[[User:Manetta/thesis/chapter-1 | chapter 1]]
[[User:Manetta/thesis/chapter-1 | chapter 1]]


[[User:Manetta/thesis/chapter-2 | chapter 2]]


[[User:Manetta/thesis/chapter-3 | chapter 3]]
</div>
</div>

Latest revision as of 16:11, 30 April 2016

chapter 1 - raw language

There lived a red-haired-man who had no eyes or ears. Neither did he have any hair, so he was called red-haired theoretically.

He couldn't speak, since he didn't have a mouth. Neither did he have a nose. He didn't even have any arms or legs. He had no stomach and he had no back and he had no spine and he had no innards whatsoever. He had nothing at all!

Therefore there's no knowing whom we are even talking about. In fact it's better that we don't say any more about him.

Blue Notebook #10 / The Red-haired Man, by Daniil Ivanovich Kharms, (Translated by Neil Cornwall), written on the 7th of January, 1931; via Matthew Fuller during Ideographies of Knowledge, Mundaneum, 03-10-2015, Mons

a non-man

The poem above describes a man, who had no mouth, no nose, nor arms or legs. He had no back, no spine and no innards whatsoever. He actually had nothing at all. All the things that we know of that are needed to recognize a man as such, are absent. The little two-letter word 'no' denies all characteristics that a normal man would have. And before the short story can continue to speak about the man in question, he has disappeared.

The poem is written by the poet Daniil Kharms in the Soviet Union in the early 1931. Daniil Kharms wrote this short story in a blue notebook, and he placed it as the tenth piece of writing in between other short stories covering the contrast of being rich/poor or smart/stupid. In the line of these identity explorations the red-haired-man echoes a quest for manliness, or even in a more general sense: a quest for being.

The specific descriptions of the red-haired-man introduce him as a figure which starts to appear but disappears at the very same moment. By removing the man's qualities while introducing him, he becomes a paradox: a non-man. While the non-man is a type of man that contradicts with itself, it is impossible to say that this man does not exist at all.


The red-haired-man is stripped from all his characteristics and presented to the reader in a state of full bareness, not in the literal sense of a naked man, but rather in the way that we can ask ourselves if the man can even still exist as a fictional character. If all the man's attributes are erased, what is then still left of him that we can speak about?

text as data

This thought-experiment makes it possible to shine a light on a material that is subjected to a similar way of being stripped of its characteristics: text that is used as data.

Already since the very early emergence of the computer [REF], Turing's article on the Turing test in 1950 [REF] and Weizenbaum's psychotherapist chat-bot ELIZA around 1965 [REF], computer scientists worked on applications that process natural language into a format that could be processed by a computer. Also linguists became interested to approach language from a computational perspective which formed the field of computational linguistics.

In the last few years, roughly since 2012 [REF], the field of natural language processing (NLP) merged with a particular hype around the possession of data, a specific culture with high aims and a strong belief in statistical computation which has been labeled with buzzwords like 'big data', 'raw data', 'text mining', 'machine learning' and more recently with the even more mystifying term 'deep learning'. These techniques attempt to find patterns in a set of data, by measuring and calculating similarities between different sets of data. Ans since social media messages, blogposts, news articles or emails are regarded as a resources for useful and valuable information, data analysts aimed to measure written text as well. By looking at specific word-occurrences and grammatical structures, the data analyst attempts to measure a text's characteristics, like sentiment, violence, certainty, vagueness, subjectivity, factuality, depression, or degrees of irony and sarcasm. But to be able to measure these qualities in a text, the paragraphs, sentences and words need to transform into data, which basically means: they need to turn into numbers.

Where and how text is transformed into data is not something that is standardized or fixed. There are many ways to process a document of written text into something that can be measured and calculated.


text processing

  • split (tokenize)

A common way to make a text processable by the computer is by splitting it up in smaller parts. The split function is a basic function that is included in i.e. the programming language python, which takes a sentence or text as an input (in string format), splits the text by default on the whitespaces, and return a list. More comprehensive software contain a stronger 'tokenizer' variant, which is written to also detect punctuation, abbreviations and short linguistic elements as a 's or 'd in English. After the split function, the text is no longer a continuous waterfall of words in sentences in paragraphs, but a list of words. This list is a format which is in line with the nature of a computer, which now can -- with the help of i.e. python or Libre Office -- sort the list in for example alphabetical order.

The split function enables to process a text in a very basic but powerful way. It transforms into data: a list of words.


  • count (bag-of-words)

Now the text has been split, the text is changed into words and other chunks of characters. A technique that transforms them into numbers is by counting the chunks that have exactly the same form. In this way, the form of the chunks is measured and expressed in a number. A technique that is often used in the field of text mining is a more extensive version of the count function, and known as bag-of-words. The bag-of-words function enables to provide a word list of words that should be excluded from the counting, such as stopwords or other often used phrases. Next to that, the bag-of-words function can count relatively, by taking all words that appear in a set of documents into account, and count the relative uniqueness of a specific word. This technique is often used to create a numerical representation of a text, which can be compared to other texts.

After applying one of these counting techniques, the text has again transformed into another type of data: a list of word-number combinations.


  • tag (part-of-speech, POS)

from sentence to word-types and syntactical structures & disambiguation (more to come here)

a non-text?

An obvious question here is when text still can be called a text. But what similarities does the data-version of a text still hold with the written document? And on what points do they now differ after the transformation into a data-object that is created for functional and useful reasons? Or is there a similar process active as Kharms described with the red-haired man? Does text become a non-text as well?


There was a text which had no layout or sentences. Neither did it have any spaces, so it was called a text theoretically.

It couldn't communicate anything, since it didn't have a message. Neither did it have an intention. It didn't even have any grammar. It had no subjects and it had no objects and it had no punctuation and it had no previous paragraph whatsoever. It had nothing at all!

Therefore there's no knowing what we are even talking about. In fact it's better that we don't say any more about it.

the non-text paradox, no context

A stripped, ordered, split and tagged text is loosing many of its characteristics and qualities. Where the man has no ears, eyes, arms, legs, hair, back nor spine, a parsed text does have no sentences, no grammar, no spaces, no punctuation, no page numbers, no reading time, no previous paragraph, no layout, it even has no initial message anymore and no author's intention. There is no subtle tone, no ironic character, no flow of certainty, no rhetoric style, no ambiguous connotation from beautiful rhymes and no shouting voice1. Because the chunks of characters are material objects, which are measurable and calculable, they are placed on the highest stage and depicted as the most bare primary ingredients, the most basic and pure elements of a text.

raw as static status

These basic elements are often considered as being raw. The idea of raw data is often used in the field of text mining and data analyses to highlight its fundamental function and (too) powerful position. They are assumed to be raw in the sense that minerals are raw, or vegetables are raw. But when you take a closer look on their rawness, it excludes the important part of how they actually became raw initially. Minerals are considered to be raw when they are still under the ground and ready to be mined for and extracted from the earth, but have been developed for many years under the influence of multiple geological processes. Vegetables are considered to be raw in a similar way. When they lie in the supermarket they are raw in the sense of being uncooked, but they have developed from a little seed over a long period of time while adapting a lot of sunlight, water and organic substances from the ground. The metaphor of rawness is often used to refer to data, but is built on an ideal that something that is raw is already there. Raw data implies the illusion that the data is true in itself, and cannot be argued with just because it is there.

ideal of rawness

Next to disguising the fact that raw data is created within a specific preceding process, rawness also contains a promise of having direct and unmediated access. Antoinette Rouvroy (2015) points towards these side-effects that occur when data is presented as a raw object. She describes how raw data implies a references to data as a natural product: “It is the idea that nature will speak by itself. It is the idea that thanks to big data, the world speaks by itself without any: transcription, symbolization, institutional mediation, political mediation or legal mediation”. If data is considered to be raw, it implies that the data can be extracted from its source without any layer of mediation. This directness that rawness connotates, constructs (again) an objective and truthful position to the data. It facilitates a strong belief that results only need to be extracted from their source, which can be done without any interference of anything or anyone.

conclusion

Where in the poem of Daniil Kharms the red-haired-man appeared and disappeared at the same time, he became a paradox: a non-man. The poem discovers the limitations of language by introducing a character that immediately is claimed not to exist. (Yankelevich, 2007) The poem becomes an exercise in which the reader is challenged to decide if the man actually has disappeared. But while a written text slowly transforms into a non-text, by having no reading order, layout, tone and other contextual elements, the text does not disappear. A non-text is no paradox in the sense that it does not contradict itself in the way that Daniil Kharms turned the red-haired-man in a fictional entity without any characteristics. Although a text looses many of its characteristics after splitting, counting or tagging, it does not disappear. Instead while the text becomes a non-text, and slowly transforms into data, it is slowly encapsulated with the ideology of rawness2.


links

thesis in progress (overview)

intro &+

chapter 1

chapter 2

chapter 3