chapter 1 - raw language

There lived a red-haired-man who had no eyes or ears. Neither did he have any hair, so he was called red-haired theoretically.

He couldn't speak, since he didn't have a mouth. Neither did he have a nose. He didn't even have any arms or legs. He had no stomach and he had no back and he had no spine and he had no innards whatsoever. He had nothing at all!

Therefore there's no knowing whom we are even talking about. In fact it's better that we don't say any more about him.

Blue Notebook #10 / The Red-haired Man, by Daniil Ivanovich Kharms, (Translated by Neil Cornwall), written on the 7th of January, 1931; via Matthew Fuller during Ideographies of Knowledge, Mundaneum, 03-10-2015, Mons

a non-man

The poem above describes a man, who had no mouth, no nose, nor arms or legs. He had no back, no spine and no innards whatsoever. He actually had nothing at all. All the things that we know of that are needed to recognize a man as such, are absent. The little two-letter word 'no' denies all characteristics that a normal man would have. And before the short story can continue to speak about the man in question, he has disappeared.

The poem is written by the poet Daniil Kharms in the Soviet Union in the early 1931. Daniil Kharms wrote this short story in a blue notebook, and he placed it as the tenth piece of writing in between other short stories covering the contrast of being rich/poor or smart/stupid. In the line of these identity explorations the red-haired-man echoes a quest for manliness, or even in a more general sense: a quest for being.

The specific descriptions of the red-haired-man introduce him as a figure which starts to appear but disappears at the very same moment. By removing the man's qualities while introducing him, he becomes a paradox: a non-man. While the non-man is a type of man that contradicts with itself, it is impossible to say that this man does not exist at all.

The red-haired-man is stripped from all his characteristics and presented to the reader in a state of full bareness, not in the literal sense of a naked man, but rather in the way that we can ask ourselves if the man can even still exist as a fictional character. If all the man's attributes are erased, what is then still left of him that we can speak about?

text as data

This thought-experiment makes it possible to shine a light on a material that is subjected to a similar way of being stripped of its characteristics: text that is used as data.

Already since the very early emergence of the computer [REF], Turing's article on the Turing test in 1950 [REF] and Weizenbaum's psychotherapist chat-bot ELIZA around 1965 [REF], computer scientists worked on applications that process natural language into a format that could be processed by a computer. Also linguists became interested to approach language from a computational perspective which formed the field of computational linguistics.

In the last few years, roughly since 2012 [REF], the field of natural language processing (NLP) merged with a particular hype around the possession of data, a specific culture with high aims and a strong belief in statistical computation which has been labeled with buzzwords like 'big data', 'raw data', 'text mining', 'machine learning' and more recently with the even more mystifying term 'deep learning'. These techniques attempt to find patterns in a set of data, by measuring and calculating similarities between different sets of data. Ans since social media messages, blogposts, news articles or emails are regarded as a resources for useful and valuable information, data analysts aimed to measure written text as well. By looking at specific word-occurrences and grammatical structures, the data analyst attempts to measure a text's characteristics, like sentiment, violence, certainty, vagueness, subjectivity, factuality, depression, or degrees of irony and sarcasm. But to be able to measure these qualities in a text, the paragraphs, sentences and words need to transform into data, which basically means: they need to turn into numbers.

Where and how text is transformed into data is not something that is standardized or fixed. There are many ways to process a document of written text into something that can be measured and calculated.

text processing

split (tokenize)

A common way to make a text processable by the computer is by splitting it up in smaller parts. The split function is a basic function that is included in i.e. the programming language python, which takes a sentence or text as an input (in string format), splits the text by default on the whitespaces, and return a list. More comprehensive software contain a stronger 'tokenizer' variant, which is written to also detect punctuation, abbreviations and short linguistic elements as a 's or 'd in English. After the split function, the text is no longer a continuous waterfall of words in sentences in paragraphs, but a list of words. This list is a format which is in line with the nature of a computer, which now can -- with the help of i.e. python or Libre Office -- sort the list in for example alphabetical order.

The split function enables to process a text in a very basic but powerful way. It transforms into data: a list of words.

count (bag-of-words)

Now the text has been split, the text is changed into words and other chunks of characters. A technique that transforms them into numbers is by counting the chunks that have exactly the same form. In this way, the form of the chunks is measured and expressed in a number. A technique that is often used in the field of text mining is a more extensive version of the count function, and known as bag-of-words. The bag-of-words function enables to provide a word list of words that should be excluded from the counting, such as stopwords or other often used phrases. Next to that, the bag-of-words function can count relatively, by taking all words that appear in a set of documents into account, and count the relative uniqueness of a specific word. This technique is often used to create a numerical representation of a text, which can be compared to other texts.

After applying one of these counting techniques, the text has again transformed into another type of data: a list of word-number combinations.

tag (part-of-speech, POS)

from sentence to word-types and syntactical structures & disambiguation (more to come here)

a non-text?

An obvious question here is when text still can be called a text. But what similarities does the data-version of a text still hold with the written document? And on what points do they now differ after the transformation into a data-object that is created for functional and useful reasons? Or is there a similar process active as Kharms described with the red-haired man? Does text become a non-text as well?

There was a text which had no layout or sentences. Neither did it have any spaces, so it was called a text theoretically.

It couldn't communicate anything, since it didn't have a message. Neither did it have an intention. It didn't even have any grammar. It had no subjects and it had no objects and it had no punctuation and it had no previous paragraph whatsoever. It had nothing at all!

Therefore there's no knowing what we are even talking about. In fact it's better that we don't say any more about it.

the non-text paradox, no context

A stripped, ordered, split and tagged text is loosing many of its characteristics and qualities. Where the man has no ears, eyes, arms, legs, hair, back nor spine, a parsed text does have no sentences, no grammar, no spaces, no punctuation, no page numbers, no reading time, no previous paragraph, no layout, it even has no initial message anymore and no author's intention. There is no subtle tone, no ironic character, no flow of certainty, no rhetoric style, no ambiguous connotation from beautiful rhymes and no shouting voice1. Because the chunks of characters are material objects, which are measurable and calculable, they are placed on the highest stage and depicted as the most bare primary ingredients, the most basic and pure elements of a text.

raw as static status

These basic elements are often considered as being raw. The idea of raw data is often used in the field of text mining and data analyses to highlight its fundamental function and (too) powerful position. They are assumed to be raw in the sense that minerals are raw, or vegetables are raw. But when you take a closer look on their rawness, it excludes the important part of how they actually became raw initially. Minerals are considered to be raw when they are still under the ground and ready to be mined for and extracted from the earth, but have been developed for many years under the influence of multiple geological processes. Vegetables are considered to be raw in a similar way. When they lie in the supermarket they are raw in the sense of being uncooked, but they have developed from a little seed over a long period of time while adapting a lot of sunlight, water and organic substances from the ground. The metaphor of rawness is often used to refer to data, but is built on an ideal that something that is raw is already there. Raw data implies the illusion that the data is true in itself, and cannot be argued with just because it is there.

ideal of rawness

Next to disguising the fact that raw data is created within a specific preceding process, rawness also contains a promise of having direct and unmediated access. Antoinette Rouvroy (2015) points towards these side-effects that occur when data is presented as a raw object. She describes how raw data implies a references to data as a natural product: “It is the idea that nature will speak by itself. It is the idea that thanks to big data, the world speaks by itself without any: transcription, symbolization, institutional mediation, political mediation or legal mediation”. If data is considered to be raw, it implies that the data can be extracted from its source without any layer of mediation. This directness that rawness connotates, constructs (again) an objective and truthful position to the data. It facilitates a strong belief that results only need to be extracted from their source, which can be done without any interference of anything or anyone.

conclusion

Where in the poem of Daniil Kharms the red-haired-man appeared and disappeared at the same time, he became a paradox: a non-man. The poem discovers the limitations of language by introducing a character that immediately is claimed not to exist. (Yankelevich, 2007) The poem becomes an exercise in which the reader is challenged to decide if the man actually has disappeared. But while a written text slowly transforms into a non-text, by having no reading order, layout, tone and other contextual elements, the text does not disappear. A non-text is no paradox in the sense that it does not contradict itself in the way that Daniil Kharms turned the red-haired-man in a fictional entity without any characteristics. Although a text looses many of its characteristics after splitting, counting or tagging, it does not disappear. Instead while the text becomes a non-text, and slowly transforms into data, it is slowly encapsulated with the ideology of rawness2.