User:Manetta/thesis/chapter-1: Difference between revisions

From XPUB & Lens-Based wiki
No edit summary
 
(6 intermediate revisions by the same user not shown)
Line 1: Line 1:
<div style="width:750px;">
<div style="width:750px;">
__TOC__
__TOC__
=i could have written that - chapter 1=
=chapter 1 - raw language=


==1.1 - text mining culture==
<div style="width:350px;">
Text mining involved NLP techniques in order to perform as an effective analytical system. By processing written natural language, the technology aims to derive information from a large set of written text.  
''There lived a red-haired-man who had no eyes or ears. Neither did he have any hair, so he was called red-haired theoretically.''


Text mining is a political sensitive technology, closely related to surveillance and privacy discussions around 'big data'. The technique in the middle of tense discussions about capturing people's behavior for security reasons, but affect the privacy of a lot of people — accompanied by an unpleasant controlling force that seems to be omnipresent. After the disclosures of the NSA's data capturing program by Edward Snowden in 2013, a wider public became aware of the silent data collecting activities done by a governmental agency on for example phone-metadata. The UK legislation made special exceptions in their copyright laws to make text mining practises possible on intellectual property for non-commercial use since October 2014. Problematic is the skewed balance between data-producer and data-analytics also framed as 'data colonialism', and the accompanied governmental-role that gives to data-analytics for example by construction your search-results-list according to your data-profile.
''He couldn't speak, since he didn't have a mouth. Neither did he have a nose. He didn't even have any arms or legs. He had no stomach and he had no back and he had no spine and he had no innards whatsoever. He had nothing at all!''


questions
''Therefore there's no knowing whom we are even talking about. In fact it's better that we don't say any more about him.''
* If text mining is regarded to be a writing system, what and where does it write?
** What are the levels of construction in text mining software culture?
*** By considering text mining technology as a reading machine?
*** How does the metaphor of 'mining' effect the process?
*** How much can be based on a cultural side-product (like the text that is commonly used, as it is extracted from ie. social media)?
 
 
=== text mining as reading machine ===
The magical effects of text mining results, caused by the hidden presence of data analytics and multi-layered complexity of text mining software, makes it difficult to formulate an opinion about text mining techniques. As the text mining technology is an analytical process, it is often understood as a 'reading' machine.
 
* 'reading' connotations?
 
(A short note here on the use of written text as source material. As Vilem Flusser discussed in his essay 'Towards a Philosophy of Photography' in 1983: “images are not 'denotative' (unambiguous), but 'connotative' (ambiguous) → complexes of symbols, providing space for connotation”. There is much more to say here, but in terms of a short note, text (and data) is not 'denotative' (unambiguous), but 'connotative' (ambiguous). Full of complexes of symbols, providing space for connotation.)
 
* 'machine' connotations?
..., Weizenbaum, about 'machine'
 
If text mining software is regarded as a reading system, it makes it even difficult to formulate what the problem exactly is. Many people are tending to agree with the calculations and word-counts that come out of the software. "What is exactly the problem?", and "This is the data that speaks, right?" are questions that need to be challenged in order to have a conversation about text mining techniques at all. These are examples of what I would like to call 'algorithmic agreeability'.
 
 
=== 'mining' data implies direct non-mediated access to the source===
====mining metaphor====
The term 'Data mining' is a fashionable buzzword that is used to speak about the practice of data-analytics. Data mining techniques are nothing new, they are around already since the 60s. (??? reference!) But the term became more fashionable since the increasing amounts of data that are published online, and made accessible for data analytics in one way or the other: a phenomenon that has been called 'big data'. (??? reference!) Applications vary from predicting an author's age, to predict how costumer's feel about a brand or product.
 
Though the term 'data mining' is actually not very accurate. When calling data-analysis software 'data mining software' is actually misleading. First, 'data' is actually not the object that a 'data mining' process is looking for. While processing data, a 'miner' rather looks for patterns that occur. Would 'pattern mining' a more accurate term to use?
 
The term contains the metaphor of 'mining', and it hints that the software is extracting information directly from the data. It implies that there is no mediating layer in between the pool of texts and the information that roles out of the software. Even if it is not 'data' that is mined for, also 'patterns' do not suddenly appear out of the big pool of data items. We will later look into a text mining workflow, and look at the steps that are effecting the outcomes. Using the 'mining' metaphor, leads to:
 
* no human responsibility
* outcomes regarded to be objective
* “it's the data that speaks”
 
 
====KDD steps (Knowledge Discovery in Data)====
As 'mining' is not a very accurate description for the technology that analyses text, it is helpful to look at a term used in the academic field: 'Knowledge Discovery in Data' (KDD) (Custers ed. 2013). Here, 'data mining' is only one step of the five in total, and only can be performed when the other are executed as well.
 
step 1 --> data collection
step 2 --> data preparation
step 3 --> data mining
step 4 --> interpretation
step 5 --> determine actions
 
To this list, I would like to add a few extra points, and rename the steps:
 
* step 0 --> deciding on point of view
  step 1 --> text collection
  step 2 --> text preparation
  step 3 --> construction of data
* step 3a --> creating a vector space
              (turning words into numbers)
* step 3b --> creating a model
              (searching for contrast in the graph)
  step 4 --> interpretation
  step 5 --> determine actions
 
<small>step 1,2,3,4,5 from: Data Mining and Profiling in Large Databases, Bart Custers, Toon Calders, Bart Schermer, and Tal Zarsky (Eds.) (2013) + step 0, 3a, 3b from: #!PATTERN+ project </small>
 
Step 0 highlights the subjective process of deciding with which datasets to work. For example: The World Well Being Project asked 66.000 Facebook users to share their messages with them, in order to investigate is the writing style on that platform could reveal something about the psychological profile that the users seemed to belong to.
 
Later I will zoom in on 'grey' moments of text mining software. Most of that moments fall under step 3, the moment of data mining. By making a distinction between the moment that written text is transformed into numbers (3a) and the moment where the 'model' is created (3b), offers some clarity that will help to look at these actions individually.
 
 
=== text mining with a cultural side-product===
Data is rather derived from written blogposts, tweets, news articles, wikipedia articles, Facebook messages and many other sources. When these media formats are created, a writer is concentrating on formulating a sentence and getting a message across. It is John that tries to tell his Twitter followers that the latest game show of the BBC is much less interesting this season. John is not consciously creating data. It's a side product, a product of the current media culture that happens in public space.
 
 
==1.2 – Pattern's* gray spots==
 
(*) Pattern is a text mining software package that includes all the steps mentioned above as KDD. The software is written and developed at the university of Antwerp as part of CliPS, a research center working in the field of Computational Linguistics & Psycholinguistics. It is a basic toolkit that includes the main 'mining' tools (like text-crawlers, text-parsers, machine learning tools and visualization scripts).
 
 
questions
 
* What are the levels of construction in text mining software itself?
** What gray spots appear when text is processed?
*** What is meant by 'grayness'? How can it be used as an approach to software critique?
*** Text processing: how does written text transform into data?
*** Bag-of-words, or 'document'
*** 'count' → 'weight'
*** trial-and-error, modeling the line
** Testing process
*** how is an algorithm actually only 'right' according to that particular test-set, with its own particularities and exceptions?
*** loops to improve your outcomes
 
 
===idea of 'greyness'?===
...
 
(Fuller & Goffey, 2012)
 
 
===bag-of-words, or 'document'===
written text ←→ text mining ←→ information
The simple act of counting words in a document is the very first act of processing text into numbers. Could the text be called data from now on? Data as a format of written text that is countable and processable for the computer software? The text is now 'ordered', or at least in computational terms. 
>>> document = Document("a black cat and a white cat", stopwords=True)
>>> print document.words
{u'a': 2, u'and': 1, u'white': 1, u'black': 1, u'cat': 2}
<small>example of bag-of-words tool, source: pattern-2.6/examples/05-vector/01-document.py</small>
 
For the computer, language is nothing more than a 'bag-of-words'. (Murtaugh, 2016). All meaning of the sentences is dropped. Also, all connection between words is gone. What stays is a word-order of most common used words. Order is discarded, and words are connected to numbers to make the text 'digestible' for a computer system.
The name of a 'bag-of-words' brings up an image of a huge bag containing piles of different heights of the same words. A top-10 of most common words could now already give insight in what topics are present in a text.
Pattern wrapped this technique in the 'document module'. This raises a confusing double use of the term 'document'. While the counted text has been extracted from a document itself (either a blog post, tweet or essay), here the bag-of-words set is called a document again, as if nothing actually happened and we still look at the actual source.
 
 
==='count' → 'weight'===
Soon after the first (brutal....) act of 'bagging' text into word-counts, the journey towards 'meaningful text' is starting again. To compare how similar two documents are (in terms of word-use), 'weight' is introduced. It normalizes term frequency, by counting the weight of each word in relation to the total amount of words in that document.
 
Document.words stores a dict of (word, count)-items.
Document.vector stores a dict of (word, weight)-items,
where weight is the term frequency normalized (0.0-1.0)
to remove document length bias.
<small>description of a 'document' in Pattern's source code, source: pattern-2.6/pattern/vector/__init__.py</small>
 
>>> document = Document("a black cat and a white cat", stopwords=True)
>>> print document.vector.features
>>> for feature, weight in document.vector.items():
    >>> print feature, weight
[u'a', u'and', u'white', u'black', u'cat']
a 0.285714285714
and 0.142857142857
white 0.142857142857
black 0.142857142857
cat 0.285714285714
<small>example to show word-weights, source: pattern-2.6/examples/05-vector/01-document.py </small>
 
 
===trial-and-error, modeling the line===
[[File:Knowledge-discovery-in-data pattern-mining-types.png|border]]
 
(description of how a mining process tries many pattern recognition algorithms, to find the most 'effective')
 
 
''Once the line has been drawn, you can throw all the data-points away,''
''because you have a model this is the moment of truth construction.''
<small>source: Data Mining and Profiling in Large Databases, Bart Custers, Toon Calders, Bart Schermer, and Tal Zarsky (Eds.) (2013)</small>
 
 
===testing data===
how is an algorithm actually only 'right' according to that particular test-set, with its own particularities and exceptions?
 
Option to look at testing techniques, like 'golden standard', 80%/20%, and more....
 
# The only way to really know if you're classifier is working correctly
# is to test it with testing data, see the documentation for Classifier.test().
<small>comment written in Pattern's KNN example, source: pattern-2.6/examples/05-vector/04-knn.py</small>
 
 
===loops to improve your outcomes===
 
'''the threshold of positivity can be lowered or raised'''
 
# The positive() function returns True if the string's polarity >= threshold.
# The threshold can be lowered or raised, but overall for strings with multiple
# words +0.1 yields the best results.
>>> print "good:", positive("good", threshold=0.1)
>>> print " bad:", positive("bad")
<small>comment written in Pattern's KNN example, source: pattern-2.6/examples/03-en/07-sentiment.py</small>
 
 
'''if you get a 0.0 value for “happy” something is wrong'''
 
[[File:If-happy-is-0.0-something-is-wrong.png|border]]
 
<small>answer by Tom de Smedt on CliPS's Google Groups about the sentiment_score() function, source: https://groups.google.com/forum/#!topic/pattern-for-python/FTeqb0p5eFM </small>


<small>Blue Notebook #10 / The Red-haired Man, by Daniil Ivanovich Kharms, (Translated by Neil Cornwall), written on the 7th of January, 1931; via Matthew Fuller during Ideographies of Knowledge, Mundaneum, 03-10-2015, Mons</small>
</div>


==a non-man==
The poem above describes a man, who had no mouth, no nose, nor arms or legs. He had no back, no spine and no innards whatsoever. He actually had nothing at all. All the things that we know of that are needed to recognize a man as such, are absent. The little two-letter word 'no' denies all characteristics that a normal man would have. And before the short story can continue to speak about the man in question, he has disappeared.


<div style="color:gray;">
The poem is written by the poet Daniil Kharms in the Soviet Union in the early 1931. Daniil Kharms wrote this short story in a blue notebook, and he placed it as the tenth piece of writing in between other short stories covering the contrast of being rich/poor or smart/stupid. In the line of these identity explorations the red-haired-man echoes a quest for manliness, or even in a more general sense: a quest for being.
other 'flags':


'''averaging polarity'''
The specific descriptions of the red-haired-man introduce him as a figure which starts to appear but disappears at the very same moment. By removing the man's qualities while introducing him, he becomes a paradox: a non-man. While the non-man is a type of man that contradicts with itself, it is impossible to say that this man does not exist at all.


<word form="amazing" wordnet_id="a-01282510" pos="JJ" sense="inspiring awe or admiration or wonder" polarity="0.8" subjectivity="1.0" intensity="1.0" confidence="0.9" />


<word form="amazing" wordnet_id="a-02359789" pos="JJ" sense="surprising greatly" polarity="0.4" subjectivity="0.8" intensity="1.0" confidence="0.9" />
The red-haired-man is stripped from all his characteristics and presented to the reader in a state of full bareness, not in the literal sense of a naked man, but rather in the way that we can ask ourselves if the man can even still exist as a fictional character. If all the man's attributes are erased, what is then still left of him that we can speak about?


<small>items of an annotated adjectives wordlist, source: pattern-2.6/pattern/text/en/en.sentiment.xml</small>
==text as data==
This thought-experiment makes it possible to shine a light on a material that is subjected to a similar way of being stripped of its characteristics: text that is used as data.
Already since the very early emergence of the computer [REF], Turing's article on the Turing test in 1950 [REF] and Weizenbaum's psychotherapist chat-bot ELIZA around 1965 [REF], computer scientists worked on applications that process natural language into a format that could be processed by a computer. Also linguists became interested to approach language from a computational perspective which formed the field of computational linguistics.  
In the last few years, roughly since 2012 [REF], the field of natural language processing (NLP) merged with a particular hype around the possession of data, a specific culture with high aims and a strong belief in statistical computation which has been labeled with buzzwords like 'big data', 'raw data', 'text mining', 'machine learning' and more recently with the even more mystifying term 'deep learning'. These techniques attempt to find patterns in a set of data, by measuring and calculating similarities between different sets of data. Ans since social media messages, blogposts, news articles or emails are regarded as a resources for useful and valuable information, data analysts aimed to measure written text as well. By looking at specific word-occurrences and grammatical structures, the data analyst attempts to measure a text's characteristics, like sentiment, violence, certainty, vagueness, subjectivity, factuality, depression, or degrees of irony and sarcasm. But to be able to measure these qualities in a text, the paragraphs, sentences and words need to transform into data, which basically means: they need to turn into numbers.
Where and how text is transformed into data is not something that is standardized or fixed. There are many ways to process a document of written text into something that can be measured and calculated.  




>>> print word, sentiment("amazing")
==text processing==
* split (tokenize)
A common way to make a text processable by the computer is by splitting it up in smaller parts. The split function is a basic function that is included in i.e. the programming language python, which takes a sentence or text as an input (in string format), splits the text by default on the whitespaces, and return a list. More comprehensive software contain a stronger 'tokenizer' variant, which is written to also detect punctuation, abbreviations and short linguistic elements as a 's or 'd in English. After the split function, the text is no longer a continuous waterfall of words in sentences in paragraphs, but a list of words. This list is a format which is in line with the nature of a computer, which now can -- with the help of i.e. python or Libre Office -- sort the list in for example alphabetical order.
The split function enables to process a text in a very basic but powerful way. It transforms into data: a list of words.


amazing (0.6000000000000001, 0.9)


<small>example script, source: pattern-2.6/examples/03-en/07-sentiment.py</small>
* count (bag-of-words)
Now the text has been split, the text is changed into words and other chunks of characters. A technique that transforms them into numbers is by counting the chunks that have exactly the same form. In this way, the form of the chunks is measured and expressed in a number. A technique that is often used in the field of text mining is a more extensive version of the count function, and known as bag-of-words. The bag-of-words function enables to provide a word list of words that should be excluded from the counting, such as stopwords or other often used phrases. Next to that, the bag-of-words function can count relatively, by taking all words that appear in a set of documents into account, and count the relative uniqueness of a specific word. This technique is often used to create a numerical representation of a text, which can be compared to other texts.
After applying one of these counting techniques, the text has again transformed into another type of data: a list of word-number combinations.  




'''annotating subjectivity'''
* tag (part-of-speech, POS)
from sentence to word-types and syntactical structures & disambiguation
(more to come here)


<word form="haha" wordnet_id="" pos="UH" polarity="0.2" subjectivity="0.3" intensity="1.0" confidence="0.9" />
==a non-text?==
An obvious question here is when text still can be called a text. But what similarities does the data-version of a text still hold with the written document? And on what points do they now differ after the transformation into a data-object that is created for functional and useful reasons? Or is there a similar process active as Kharms described with the red-haired man? Does text become a non-text as well?


<word form="hahaha" wordnet_id="" pos="UH" polarity="0.2" subjectivity="0.4" intensity="1.0" confidence="0.9" />


<word form="hahahaha" wordnet_id="" pos="UH" polarity="0.2" subjectivity="0.5" intensity="1.0" confidence="0.9" />
<div style="width:350px;">
''There was a text which had no layout or sentences. Neither did it have any spaces, so it was called a text theoretically.''


<word form="hahahahaha" wordnet_id="" pos="UH" polarity="0.2" subjectivity="0.6" intensity="1.0" confidence="0.9" />
''It couldn't communicate anything, since it didn't have a message. Neither did it have an intention. It didn't even have any grammar. It had no subjects and it had no objects and it had no punctuation and it had no previous paragraph whatsoever. It had nothing at all!''


<small>items of an annotated adjectives wordlist, source: pattern-2.6/pattern/text/en/en.sentiment.xml</small>
''Therefore there's no knowing what we are even talking about. In fact it's better that we don't say any more about it.''
</div>
</div>


==1.3 – text mining applications==
==the non-text paradox, no context==
A stripped, ordered, split and tagged text is loosing many of its characteristics and qualities. Where the man has no ears, eyes, arms, legs, hair, back nor spine, a parsed text does have no sentences, no grammar, no spaces, no punctuation, no page numbers, no reading time, no previous paragraph, no layout, it even has no initial message anymore and no author's intention. There is no subtle tone, no ironic character, no flow of certainty, no rhetoric style, no ambiguous connotation from beautiful rhymes and no shouting voice1. Because the chunks of characters are material objects, which are measurable and calculable, they are placed on the highest stage and depicted as the most bare primary ingredients, the most basic and pure elements of a text.


=== applications of text mining ===
==raw as static status==
[[User:Manetta/i-could-have-written-that/kdd-applications |of Pattern, Weka, and the World Well Being Project &rarr; listed here]]
These basic elements are often considered as being raw. The idea of raw data is often used in the field of text mining and data analyses to highlight its fundamental function and (too) powerful position. They are assumed to be raw in the sense that minerals are raw, or vegetables are raw. But when you take a closer look on their rawness, it excludes the important part of how they actually became raw initially. Minerals are considered to be raw when they are still under the ground and ready to be mined for and extracted from the earth, but have been developed for many years under the influence of multiple geological processes. Vegetables are considered to be raw in a similar way. When they lie in the supermarket they are raw in the sense of being uncooked, but they have developed from a little seed over a long period of time while adapting a lot of sunlight, water and organic substances from the ground. The metaphor of rawness is often used to refer to data, but is built on an ideal that something that is raw is already there. Raw data implies the illusion that the data is true in itself, and cannot be argued with just because it is there.


Showing that text mining has been applied across very different field, and thereby seeming to be a sort of 'holy grail', solving a lot of problems.
==ideal of rawness==
Next to disguising the fact that raw data is created within a specific preceding process, rawness also contains a promise of having direct and unmediated access. Antoinette Rouvroy (2015) points towards these side-effects that occur when data is presented as a raw object. She describes how raw data implies a references to data as a natural product: “It is the idea that nature will speak by itself. It is the idea that thanks to big data, the world speaks by itself without any: transcription, symbolization, institutional mediation, political mediation or legal mediation”. If data is considered to be raw, it implies that the data can be extracted from its source without any layer of mediation. This directness that rawness connotates, constructs (again) an objective and truthful position to the data. It facilitates a strong belief that results only need to be extracted from their source, which can be done without any interference of anything or anyone.  


(i'm not sure if this is needed)
==conclusion==
Where in the poem of Daniil Kharms the red-haired-man appeared and disappeared at the same time, he became a paradox: a non-man. The poem discovers the limitations of language by introducing a character that immediately is claimed not to exist. (Yankelevich, 2007) The poem becomes an exercise in which the reader is challenged to decide if the man actually has disappeared. But while a written text slowly transforms into a non-text, by having no reading order, layout, tone and other contextual elements, the text does not disappear. A non-text is no paradox in the sense that it does not contradict itself in the way that Daniil Kharms turned the red-haired-man in a fictional entity without any characteristics. Although a text looses many of its characteristics after splitting, counting or tagging, it does not disappear. Instead while the text becomes a non-text, and slowly transforms into data, it is slowly encapsulated with the ideology of rawness2.




Line 228: Line 82:
[[User:Manetta/thesis/chapter-1 | chapter 1]]
[[User:Manetta/thesis/chapter-1 | chapter 1]]


[[User:Manetta/thesis/chapter-2 | chapter 2]]


[[User:Manetta/thesis/chapter-3 | chapter 3]]
</div>
</div>

Latest revision as of 15:11, 30 April 2016

chapter 1 - raw language

There lived a red-haired-man who had no eyes or ears. Neither did he have any hair, so he was called red-haired theoretically.

He couldn't speak, since he didn't have a mouth. Neither did he have a nose. He didn't even have any arms or legs. He had no stomach and he had no back and he had no spine and he had no innards whatsoever. He had nothing at all!

Therefore there's no knowing whom we are even talking about. In fact it's better that we don't say any more about him.

Blue Notebook #10 / The Red-haired Man, by Daniil Ivanovich Kharms, (Translated by Neil Cornwall), written on the 7th of January, 1931; via Matthew Fuller during Ideographies of Knowledge, Mundaneum, 03-10-2015, Mons

a non-man

The poem above describes a man, who had no mouth, no nose, nor arms or legs. He had no back, no spine and no innards whatsoever. He actually had nothing at all. All the things that we know of that are needed to recognize a man as such, are absent. The little two-letter word 'no' denies all characteristics that a normal man would have. And before the short story can continue to speak about the man in question, he has disappeared.

The poem is written by the poet Daniil Kharms in the Soviet Union in the early 1931. Daniil Kharms wrote this short story in a blue notebook, and he placed it as the tenth piece of writing in between other short stories covering the contrast of being rich/poor or smart/stupid. In the line of these identity explorations the red-haired-man echoes a quest for manliness, or even in a more general sense: a quest for being.

The specific descriptions of the red-haired-man introduce him as a figure which starts to appear but disappears at the very same moment. By removing the man's qualities while introducing him, he becomes a paradox: a non-man. While the non-man is a type of man that contradicts with itself, it is impossible to say that this man does not exist at all.


The red-haired-man is stripped from all his characteristics and presented to the reader in a state of full bareness, not in the literal sense of a naked man, but rather in the way that we can ask ourselves if the man can even still exist as a fictional character. If all the man's attributes are erased, what is then still left of him that we can speak about?

text as data

This thought-experiment makes it possible to shine a light on a material that is subjected to a similar way of being stripped of its characteristics: text that is used as data.

Already since the very early emergence of the computer [REF], Turing's article on the Turing test in 1950 [REF] and Weizenbaum's psychotherapist chat-bot ELIZA around 1965 [REF], computer scientists worked on applications that process natural language into a format that could be processed by a computer. Also linguists became interested to approach language from a computational perspective which formed the field of computational linguistics.

In the last few years, roughly since 2012 [REF], the field of natural language processing (NLP) merged with a particular hype around the possession of data, a specific culture with high aims and a strong belief in statistical computation which has been labeled with buzzwords like 'big data', 'raw data', 'text mining', 'machine learning' and more recently with the even more mystifying term 'deep learning'. These techniques attempt to find patterns in a set of data, by measuring and calculating similarities between different sets of data. Ans since social media messages, blogposts, news articles or emails are regarded as a resources for useful and valuable information, data analysts aimed to measure written text as well. By looking at specific word-occurrences and grammatical structures, the data analyst attempts to measure a text's characteristics, like sentiment, violence, certainty, vagueness, subjectivity, factuality, depression, or degrees of irony and sarcasm. But to be able to measure these qualities in a text, the paragraphs, sentences and words need to transform into data, which basically means: they need to turn into numbers.

Where and how text is transformed into data is not something that is standardized or fixed. There are many ways to process a document of written text into something that can be measured and calculated.


text processing

  • split (tokenize)

A common way to make a text processable by the computer is by splitting it up in smaller parts. The split function is a basic function that is included in i.e. the programming language python, which takes a sentence or text as an input (in string format), splits the text by default on the whitespaces, and return a list. More comprehensive software contain a stronger 'tokenizer' variant, which is written to also detect punctuation, abbreviations and short linguistic elements as a 's or 'd in English. After the split function, the text is no longer a continuous waterfall of words in sentences in paragraphs, but a list of words. This list is a format which is in line with the nature of a computer, which now can -- with the help of i.e. python or Libre Office -- sort the list in for example alphabetical order.

The split function enables to process a text in a very basic but powerful way. It transforms into data: a list of words.


  • count (bag-of-words)

Now the text has been split, the text is changed into words and other chunks of characters. A technique that transforms them into numbers is by counting the chunks that have exactly the same form. In this way, the form of the chunks is measured and expressed in a number. A technique that is often used in the field of text mining is a more extensive version of the count function, and known as bag-of-words. The bag-of-words function enables to provide a word list of words that should be excluded from the counting, such as stopwords or other often used phrases. Next to that, the bag-of-words function can count relatively, by taking all words that appear in a set of documents into account, and count the relative uniqueness of a specific word. This technique is often used to create a numerical representation of a text, which can be compared to other texts.

After applying one of these counting techniques, the text has again transformed into another type of data: a list of word-number combinations.


  • tag (part-of-speech, POS)

from sentence to word-types and syntactical structures & disambiguation (more to come here)

a non-text?

An obvious question here is when text still can be called a text. But what similarities does the data-version of a text still hold with the written document? And on what points do they now differ after the transformation into a data-object that is created for functional and useful reasons? Or is there a similar process active as Kharms described with the red-haired man? Does text become a non-text as well?


There was a text which had no layout or sentences. Neither did it have any spaces, so it was called a text theoretically.

It couldn't communicate anything, since it didn't have a message. Neither did it have an intention. It didn't even have any grammar. It had no subjects and it had no objects and it had no punctuation and it had no previous paragraph whatsoever. It had nothing at all!

Therefore there's no knowing what we are even talking about. In fact it's better that we don't say any more about it.

the non-text paradox, no context

A stripped, ordered, split and tagged text is loosing many of its characteristics and qualities. Where the man has no ears, eyes, arms, legs, hair, back nor spine, a parsed text does have no sentences, no grammar, no spaces, no punctuation, no page numbers, no reading time, no previous paragraph, no layout, it even has no initial message anymore and no author's intention. There is no subtle tone, no ironic character, no flow of certainty, no rhetoric style, no ambiguous connotation from beautiful rhymes and no shouting voice1. Because the chunks of characters are material objects, which are measurable and calculable, they are placed on the highest stage and depicted as the most bare primary ingredients, the most basic and pure elements of a text.

raw as static status

These basic elements are often considered as being raw. The idea of raw data is often used in the field of text mining and data analyses to highlight its fundamental function and (too) powerful position. They are assumed to be raw in the sense that minerals are raw, or vegetables are raw. But when you take a closer look on their rawness, it excludes the important part of how they actually became raw initially. Minerals are considered to be raw when they are still under the ground and ready to be mined for and extracted from the earth, but have been developed for many years under the influence of multiple geological processes. Vegetables are considered to be raw in a similar way. When they lie in the supermarket they are raw in the sense of being uncooked, but they have developed from a little seed over a long period of time while adapting a lot of sunlight, water and organic substances from the ground. The metaphor of rawness is often used to refer to data, but is built on an ideal that something that is raw is already there. Raw data implies the illusion that the data is true in itself, and cannot be argued with just because it is there.

ideal of rawness

Next to disguising the fact that raw data is created within a specific preceding process, rawness also contains a promise of having direct and unmediated access. Antoinette Rouvroy (2015) points towards these side-effects that occur when data is presented as a raw object. She describes how raw data implies a references to data as a natural product: “It is the idea that nature will speak by itself. It is the idea that thanks to big data, the world speaks by itself without any: transcription, symbolization, institutional mediation, political mediation or legal mediation”. If data is considered to be raw, it implies that the data can be extracted from its source without any layer of mediation. This directness that rawness connotates, constructs (again) an objective and truthful position to the data. It facilitates a strong belief that results only need to be extracted from their source, which can be done without any interference of anything or anyone.

conclusion

Where in the poem of Daniil Kharms the red-haired-man appeared and disappeared at the same time, he became a paradox: a non-man. The poem discovers the limitations of language by introducing a character that immediately is claimed not to exist. (Yankelevich, 2007) The poem becomes an exercise in which the reader is challenged to decide if the man actually has disappeared. But while a written text slowly transforms into a non-text, by having no reading order, layout, tone and other contextual elements, the text does not disappear. A non-text is no paradox in the sense that it does not contradict itself in the way that Daniil Kharms turned the red-haired-man in a fictional entity without any characteristics. Although a text looses many of its characteristics after splitting, counting or tagging, it does not disappear. Instead while the text becomes a non-text, and slowly transforms into data, it is slowly encapsulated with the ideology of rawness2.


links

thesis in progress (overview)

intro &+

chapter 1

chapter 2

chapter 3