User:Manetta/thesis/chapter-1: Difference between revisions

From XPUB & Lens-Based wiki
No edit summary
No edit summary
Line 3: Line 3:
=i could have written that - chapter 1=
=i could have written that - chapter 1=


==1.1 - text mining culture==
==on what basis? three settings to highlight differences in text analytical ideologies==
Text mining involved NLP techniques in order to perform as an effective analytical system. By processing written natural language, the technology aims to derive information from a large set of written text.


Text mining is a political sensitive technology, closely related to surveillance and privacy discussions around 'big data'. The technique in the middle of tense discussions about capturing people's behavior for security reasons, but affect the privacy of a lot of people — accompanied by an unpleasant controlling force that seems to be omnipresent. After the disclosures of the NSA's data capturing program by Edward Snowden in 2013, a wider public became aware of the silent data collecting activities done by a governmental agency on for example phone-metadata. The UK legislation made special exceptions in their copyright laws to make text mining practises possible on intellectual property for non-commercial use since October 2014. Problematic is the skewed balance between data-producer and data-analytics also framed as 'data colonialism', and the accompanied governmental-role that gives to data-analytics for example by construction your search-results-list according to your data-profile.  
===Setting 1===
[[File:EUR-PhD-defence-sentiment-mining.JPG|thumb|left]]
It is a wet Friday afternoon in mid November 2015. A woman dressed in traditional academic garment enters the lecture room. The people in the audience stand up from their seats. The woman carries a stick with bells. They tinkle softly. It seems to be her way to tell everyone to be silent and stay focused on that what is coming. A group of two woman and seven man dressed in toga's follow her. They walk to their seats behind the jury table. The doctoral candidate of the Economy department starts his defense. He introduces his area of research with a short description of the increasing amount of information that is published on the internet these days in the form of text. Text that could be used to extract information about the reputation of a company. It is important for decision makers to know how the public feels about their products and services. Online written material such as reviews, are a very useful source for companies to extract that information from. How could this be done? The candidate illustrates his methododology with an image of a branch with multiple leaves. When looking at the leaves, one could order them by color, or shape. Such ordering techniques can be applied to written language as well: by analyzing and sorting words. The candidate's topic of research has been 'sentiment analysis in written texts'. This nothing new. Sentiment analysis is a common tool in the text mining field. The candidate's aim is to improve this technique. He proposes to detect emoticons as sentiment, and to add more weight to more important segments of a sentence.


questions
One of the professors opens the discussion on a critical tone. He asks the candidate to his definition of the word 'sentiment'. The candidate replies by saying that sentiment is what people intend to convey. There is the sentiment that the reader perceives, and there is sentiment that the writer conveys. In the case of reviews, sentiment is a judgment. The professor states that the candidate only used the values '-1' and '+1' to describe sentiment in his thesis, which is not a definition. The professor continues by asking if the candidate could offer a theory where the thesis has been based on. But there is again no answer that fulfills the professor's request. The professor claims that the candidate's thesis only presents numbers, no definitions.
* If text mining is regarded to be a writing system, what and where does it write?
** What are the levels of construction in text mining software culture?
*** By considering text mining technology as a reading machine?
*** How does the metaphor of 'mining' effect the process?
*** How much can be based on a cultural side-product (like the text that is commonly used, as it is extracted from ie. social media)?


Another professor continues and asks for the 'neutral backbone' that is used in the research to validate the sentiment of certain words. Did the candidate collaborate with psychologists for example? The candidate replies that he collaborated with a company that manually rated the sentiment values of words and sentences. He cannot give a description about how that annotation process has been executed. The professor highlights the importance of an external backbone that is needed in order to be able to give results. Which brings him to his next question. The professor counted 6000 calculations that had been done to confirm the candidate's hypothesis. This 'hypothesis testing' phenomenon is a recurring element in the arguments of the thesis. The candidate is asked if he wasn't over-enthusiastic in his results.


=== text mining as reading machine ===
But the jury must also admit that it is quite an achievement to write a doctoral thesis on the topic of text mining at a university where there is no department of linguistics, and neither in computer science. The candidate was located at the Economics department under the 'Erasmus Research Institute of Management'. Though, when the candidate was asked about his plans to fix the gaps in his thesis, he replied with saying that he already had a job in the business, and rewriting his thesis would not be a priority nor his primary interest.  
The magical effects of text mining results, caused by the hidden presence of data analytics and multi-layered complexity of text mining software, makes it difficult to formulate an opinion about text mining techniques. As the text mining technology is an analytical process, it is often understood as a 'reading' machine.  


* 'reading' connotations?


(A short note here on the use of written text as source material. As Vilem Flusser discussed in his essay 'Towards a Philosophy of Photography' in 1983: “images are not 'denotative' (unambiguous), but 'connotative' (ambiguous) → complexes of symbols, providing space for connotation”. There is much more to say here, but in terms of a short note, text (and data) is not 'denotative' (unambiguous), but 'connotative' (ambiguous). Full of complexes of symbols, providing space for connotation.)
===Setting 2===
2:11 Now let's start with something that's relatively clear,
2:14 and let's see if it makes sense.
2:15 See the words that are most typical, most discriminative, most predictive,
2:21 of being female.
2:23 (Laughter)
2:30 Yeah, it's a little bit embarrassing, I'm sorry, but I didn't make this up!
2:34 It's very cliché, but these are the words.


* 'machine' connotations?
[[File:TED-talk-screenshot_Lyle-Unger-World-Well-Being-Project_predicting-heartdiseases-using-Twitter.png|thumb|left]]
..., Weizenbaum, about 'machine'


If text mining software is regarded as a reading system, it makes it even difficult to formulate what the problem exactly is. Many people are tending to agree with the calculations and word-counts that come out of the software. "What is exactly the problem?", and "This is the data that speaks, right?" are questions that need to be challenged in order to have a conversation about text mining techniques at all. These are examples of what I would like to call 'algorithmic agreeability'.  
The video reaches the 2:11 minutes when Lyle Ungar starts his introduction to the first text mining results that he will present his TED audience tonight. Luckily enough he can start with something that is relatively clear: the words that are most typical, most discriminative, most predictive, of being female. Nothing too complicated to start with. He proposes to team up with his audience, to see together if the outcomes make sense. Only 10 seconds later Lyle hits the button on the TED slide remote controller. In the reflection of his glasses appears a bright semblance. The audience slowly starts to titter. Lyle looks up to face his audience. He frowns, turns his head, walks a few steps to the right and sights theatrical and somewhat too loud. While Lyle's posture speaks the language of shameful soreness, a white slide with colorful words appears on the screen. '<3' is typeset in the largest font size, followed by 'excited', 'shopping', 'love you' and 'my hair', surrounded by another +-50 words that together form the shape of a cloud. '(Laughter)', appears in the subtitles. The audience seems to recognize the words, and responds to them with a stiffled laughter. Is it the term 'shopping' that appears so big that is funny? Because it confirms a stereotype? Or is it surprising to see what extreme expressions appear to be typical for being female? Lyle had seen it coming, and quickly excuses himself for the results by saying: I didn't make this up! It's very cliché, but these are the words.  


These results are part of a text mining research project of the University of Pennsylvania called the 'World Well Being Project' (WWBP). The project is located at the 'Positive Psychology Center', and aims to measure psychological well-being and physical health by analyzing written language on social media. For the results that Lyle Ungar presented at the TED presentations in Pennsylvania 2015, a group of 66.000 Facebook users were asked to share their messages and posts with the research group, together with their age and gender. They were also asked to fill in the 'big five personality test'. A widely used questionnaire that is used by psychologists to describe human personalities and returns a value for 'openness', 'conscientiousness', 'extraversion', 'agreeableness', and 'neuroticism'. Text mining here is used as a technique to derive information about Facebook users by connecting their word usage to their age, gender and personality profile.


=== 'mining' data implies direct non-mediated access to the source===
====mining metaphor====
The term 'Data mining' is a fashionable buzzword that is used to speak about the practice of data-analytics. Data mining techniques are nothing new, they are around already since the 60s. (??? reference!) But the term became more fashionable since the increasing amounts of data that are published online, and made accessible for data analytics in one way or the other: a phenomenon that has been called 'big data'. (??? reference!) Applications vary from predicting an author's age, to predict how costumer's feel about a brand or product.


Though the term 'data mining' is actually not very accurate. When calling data-analysis software 'data mining software' is actually misleading. First, 'data' is actually not the object that a 'data mining' process is looking for. While processing data, a 'miner' rather looks for patterns that occur. Would 'pattern mining' a more accurate term to use?
===Setting 3===
[[File:Cqrrelations_Guy-de-Pauw-CLiPS-Pattern-introduction_small.jpg|thumb|left]]


The term contains the metaphor of 'mining', and it hints that the software is extracting information directly from the data. It implies that there is no mediating layer in between the pool of texts and the information that roles out of the software. Even if it is not 'data' that is mined for, also 'patterns' do not suddenly appear out of the big pool of data items. We will later look into a text mining workflow, and look at the steps that are effecting the outcomes. Using the 'mining' metaphor, leads to:  
Guy de Pauw is in the middle of his presentation, when he calls text mining a technology of shallow understanding. It is a cold week in mid January 2015. The room is filled with 40 artists, researchers, designers, activists, students (among others), of which most are interested in, or working with free software. A lot of the people sit with laptops on their laps, trying to keep up with the speed and amount of information. Not many people in the audience are familiar with text mining techniques, and Guy's presentation is full of text mining jargon. To make as many notes as possible seems to be the best strategy for the moment. In the meanwhile, Guy formulates the fundamental problems that text mining is facing: how to transform text from form to meaning? How to deal with semantics and meaning? And, how can a computer 'understand' natural language without any world knowledge? It is telling how much effort Guy takes to show the problematic points in text understanding practices. In one of his next slides, Guy shows an image where one sentence is interpreted in five different ways. Each version of the sentence pretty little girl's school is illustrated to reveal the different meanings that this short sentence contains. Guy transcribes shortly: “Version one: the pretty school for little girls. Version two: the seemingly little girl and her school. Version three: the beautiful little girl and her school. And so forth.”


* no human responsibility
[[File:From_CLiPS-presentations-during-Cqrrelations_jan-2015_Brussels-Pretty-little-girls-school.png|center]]
* outcomes regarded to be objective
* “it's the data that speaks”


A few minutes earlier, Guy showed an image of two wordclouds that represent words, phrases and topics most highly distinguishing females and males. '<3', 'shopping' and 'excited' are labeled as being most typical female. 'Fuck', 'wishes', and 'he' are presented as most typically 'male'. A little rush of indignation moved through the room. 'But, how?!'. You could see question marks rising above many heads. How is this graph constructed? Where does it come from? Guy explained how he is interested in gender detection in a different sense. In the graph, words were connected to themes and topics, whereupon it is only a small step to speak about 'meaning' and what females 'are'. Guy's next slide showed how he is more interested to look at gender in a grammatical way. By analyzing the structures of sentences that are written by females and comparing these to male-written sentences. Then, all there is to say is: women use more relational language and men more informative language.


====KDD steps (Knowledge Discovery in Data)====
Shallow understanding? Guy shows the website 'biograph.be' to illustrate his statement. It is a text mining project where connections are drawn between hypotheses of academic papers. The project can be used for prevention, diagnosis or treatment purposes.  'Automated knowledge discovery' is promised to prevent anyone from 'drowning in information'. Guy adds some critical remarks: using this technology in medial contexts “will lead to a fragmentation of the field” as well as to “poor communication between subfields”.  
As 'mining' is not a very accurate description for the technology that analyses text, it is helpful to look at a term used in the academic field: 'Knowledge Discovery in Data' (KDD) (Custers ed. 2013). Here, 'data mining' is only one step of the five in total, and only can be performed when the other are executed as well.


step 1 --> data collection
Guy is invited to speak and introduce the group to a text mining software package. The software is called 'Pattern' and developed at the university of Antwerp, where Guy is part of the CLiPS research group: 'Computational Linguistics & Psycholinguistics'. Coming from a linguistic background, the CLiPS research group is approaching their project rather from structural approaches than statistical. This nuance is difficult to grasp when only results are presented. Guy hits a button on his keyboard and his presentation jumps to the next slide. It is an overview of linguistic approaches to text understanding for computers. The slide shows a short bullet-pointed overview. Coming from a knowledge representation approach in the 70s, where sentence structures were described in models that were fed into the computer. Via a knowledge-based approach in the 80s, where corpora were created to recognize sentence structures on a word-level. Word types as 'noun', 'verb' or 'adjective' functioned for example as labels. Towards the period that started in the mid 90s: a statistical and shallow understanding approach. Text understanding became scalable, efficient and robust. Making linguistic models became easier and cheaper. Guy adds immediately a critical remark: is this a phenomenon of scaling up by dumbing down?
step 2 --> data preparation
step 3 --> data mining
step 4 --> interpretation
step 5 --> determine actions
 
To this list, I would like to add a few extra points, and rename the steps:
 
* step 0 --> deciding on point of view
  step 1 --> text collection
  step 2 --> text preparation
  step 3 --> construction of data
* step 3a --> creating a vector space
              (turning words into numbers)
* step 3b --> creating a model
              (searching for contrast in the graph)
  step 4 --> interpretation
  step 5 --> determine actions
 
<small>step 1,2,3,4,5 from: Data Mining and Profiling in Large Databases, Bart Custers, Toon Calders, Bart Schermer, and Tal Zarsky (Eds.) (2013) + step 0, 3a, 3b from: #!PATTERN+ project </small>
 
Step 0 highlights the subjective process of deciding with which datasets to work. For example: The World Well Being Project asked 66.000 Facebook users to share their messages with them, in order to investigate is the writing style on that platform could reveal something about the psychological profile that the users seemed to belong to.
 
Later I will zoom in on 'grey' moments of text mining software. Most of that moments fall under step 3, the moment of data mining. By making a distinction between the moment that written text is transformed into numbers (3a) and the moment where the 'model' is created (3b), offers some clarity that will help to look at these actions individually.
 
 
=== text mining with a cultural side-product===
Data is rather derived from written blogposts, tweets, news articles, wikipedia articles, Facebook messages and many other sources. When these media formats are created, a writer is concentrating on formulating a sentence and getting a message across. It is John that tries to tell his Twitter followers that the latest game show of the BBC is much less interesting this season. John is not consciously creating data. It's a side product, a product of the current media culture that happens in public space.
 
 
==1.2 – Pattern's* gray spots==
 
(*) Pattern is a text mining software package that includes all the steps mentioned above as KDD. The software is written and developed at the university of Antwerp as part of CliPS, a research center working in the field of Computational Linguistics & Psycholinguistics. It is a basic toolkit that includes the main 'mining' tools (like text-crawlers, text-parsers, machine learning tools and visualization scripts).
 
 
questions
 
* What are the levels of construction in text mining software itself?
** What gray spots appear when text is processed?
*** What is meant by 'grayness'? How can it be used as an approach to software critique?
*** Text processing: how does written text transform into data?
*** Bag-of-words, or 'document'
*** 'count' → 'weight'
*** trial-and-error, modeling the line
** Testing process
*** how is an algorithm actually only 'right' according to that particular test-set, with its own particularities and exceptions?
*** loops to improve your outcomes
 
 
===idea of 'greyness'?===
...
 
(Fuller & Goffey, 2012)
 
 
===bag-of-words, or 'document'===
written text ←→ text mining ←→ information
The simple act of counting words in a document is the very first act of processing text into numbers. Could the text be called data from now on? Data as a format of written text that is countable and processable for the computer software? The text is now 'ordered', or at least in computational terms. 
>>> document = Document("a black cat and a white cat", stopwords=True)
>>> print document.words
{u'a': 2, u'and': 1, u'white': 1, u'black': 1, u'cat': 2}
<small>example of bag-of-words tool, source: pattern-2.6/examples/05-vector/01-document.py</small>
 
For the computer, language is nothing more than a 'bag-of-words'. (Murtaugh, 2016). All meaning of the sentences is dropped. Also, all connection between words is gone. What stays is a word-order of most common used words. Order is discarded, and words are connected to numbers to make the text 'digestible' for a computer system.  
The name of a 'bag-of-words' brings up an image of a huge bag containing piles of different heights of the same words. A top-10 of most common words could now already give insight in what topics are present in a text.
Pattern wrapped this technique in the 'document module'. This raises a confusing double use of the term 'document'. While the counted text has been extracted from a document itself (either a blog post, tweet or essay), here the bag-of-words set is called a document again, as if nothing actually happened and we still look at the actual source.
 
 
==='count' → 'weight'===
Soon after the first (brutal....) act of 'bagging' text into word-counts, the journey towards 'meaningful text' is starting again. To compare how similar two documents are (in terms of word-use), 'weight' is introduced. It normalizes term frequency, by counting the weight of each word in relation to the total amount of words in that document.
 
Document.words stores a dict of (word, count)-items.
Document.vector stores a dict of (word, weight)-items,
where weight is the term frequency normalized (0.0-1.0)
to remove document length bias.
<small>description of a 'document' in Pattern's source code, source: pattern-2.6/pattern/vector/__init__.py</small>
 
>>> document = Document("a black cat and a white cat", stopwords=True)
>>> print document.vector.features
>>> for feature, weight in document.vector.items():
    >>> print feature, weight
[u'a', u'and', u'white', u'black', u'cat']
a 0.285714285714
and 0.142857142857
white 0.142857142857
black 0.142857142857
cat 0.285714285714
<small>example to show word-weights, source: pattern-2.6/examples/05-vector/01-document.py </small>
 
 
===trial-and-error, modeling the line===
[[File:Knowledge-discovery-in-data pattern-mining-types.png|border]]
 
(description of how a mining process tries many pattern recognition algorithms, to find the most 'effective')
 
 
''Once the line has been drawn, you can throw all the data-points away,''
''because you have a model this is the moment of truth construction.''
<small>source: Data Mining and Profiling in Large Databases, Bart Custers, Toon Calders, Bart Schermer, and Tal Zarsky (Eds.) (2013)</small>
 
 
===testing data===
how is an algorithm actually only 'right' according to that particular test-set, with its own particularities and exceptions?
 
Option to look at testing techniques, like 'golden standard', 80%/20%, and more....
 
# The only way to really know if you're classifier is working correctly
# is to test it with testing data, see the documentation for Classifier.test().
<small>comment written in Pattern's KNN example, source: pattern-2.6/examples/05-vector/04-knn.py</small>
 
 
===loops to improve your outcomes===
 
'''the threshold of positivity can be lowered or raised'''
 
# The positive() function returns True if the string's polarity >= threshold.
# The threshold can be lowered or raised, but overall for strings with multiple
# words +0.1 yields the best results.
>>> print "good:", positive("good", threshold=0.1)
>>> print " bad:", positive("bad")
<small>comment written in Pattern's KNN example, source: pattern-2.6/examples/03-en/07-sentiment.py</small>
 
 
'''if you get a 0.0 value for “happy” something is wrong'''
 
[[File:If-happy-is-0.0-something-is-wrong.png|border]]
 
<small>answer by Tom de Smedt on CliPS's Google Groups about the sentiment_score() function, source: https://groups.google.com/forum/#!topic/pattern-for-python/FTeqb0p5eFM </small>
 
 
 
<div style="color:gray;">
other 'flags':
 
'''averaging polarity'''
 
<word form="amazing" wordnet_id="a-01282510" pos="JJ" sense="inspiring awe or admiration or wonder" polarity="0.8" subjectivity="1.0" intensity="1.0" confidence="0.9" />
 
<word form="amazing" wordnet_id="a-02359789" pos="JJ" sense="surprising greatly" polarity="0.4" subjectivity="0.8" intensity="1.0" confidence="0.9" />
 
<small>items of an annotated adjectives wordlist, source: pattern-2.6/pattern/text/en/en.sentiment.xml</small>
 
 
>>> print word, sentiment("amazing")
 
amazing (0.6000000000000001, 0.9)
 
<small>example script, source: pattern-2.6/examples/03-en/07-sentiment.py</small>
 
 
'''annotating subjectivity'''
 
<word form="haha" wordnet_id="" pos="UH" polarity="0.2" subjectivity="0.3" intensity="1.0" confidence="0.9" />
 
<word form="hahaha" wordnet_id="" pos="UH" polarity="0.2" subjectivity="0.4" intensity="1.0" confidence="0.9" />
 
<word form="hahahaha" wordnet_id="" pos="UH" polarity="0.2" subjectivity="0.5" intensity="1.0" confidence="0.9" />
 
<word form="hahahahaha" wordnet_id="" pos="UH" polarity="0.2" subjectivity="0.6" intensity="1.0" confidence="0.9" />
 
<small>items of an annotated adjectives wordlist, source: pattern-2.6/pattern/text/en/en.sentiment.xml</small>
</div>
 
==1.3 – text mining applications==
 
=== applications of text mining ===
[[User:Manetta/i-could-have-written-that/kdd-applications |of Pattern, Weka, and the World Well Being Project &rarr; listed here]]
 
Showing that text mining has been applied across very different field, and thereby seeming to be a sort of 'holy grail', solving a lot of problems.
 
(i'm not sure if this is needed)





Revision as of 17:22, 3 March 2016

i could have written that - chapter 1

on what basis? three settings to highlight differences in text analytical ideologies

Setting 1

EUR-PhD-defence-sentiment-mining.JPG

It is a wet Friday afternoon in mid November 2015. A woman dressed in traditional academic garment enters the lecture room. The people in the audience stand up from their seats. The woman carries a stick with bells. They tinkle softly. It seems to be her way to tell everyone to be silent and stay focused on that what is coming. A group of two woman and seven man dressed in toga's follow her. They walk to their seats behind the jury table. The doctoral candidate of the Economy department starts his defense. He introduces his area of research with a short description of the increasing amount of information that is published on the internet these days in the form of text. Text that could be used to extract information about the reputation of a company. It is important for decision makers to know how the public feels about their products and services. Online written material such as reviews, are a very useful source for companies to extract that information from. How could this be done? The candidate illustrates his methododology with an image of a branch with multiple leaves. When looking at the leaves, one could order them by color, or shape. Such ordering techniques can be applied to written language as well: by analyzing and sorting words. The candidate's topic of research has been 'sentiment analysis in written texts'. This nothing new. Sentiment analysis is a common tool in the text mining field. The candidate's aim is to improve this technique. He proposes to detect emoticons as sentiment, and to add more weight to more important segments of a sentence.

One of the professors opens the discussion on a critical tone. He asks the candidate to his definition of the word 'sentiment'. The candidate replies by saying that sentiment is what people intend to convey. There is the sentiment that the reader perceives, and there is sentiment that the writer conveys. In the case of reviews, sentiment is a judgment. The professor states that the candidate only used the values '-1' and '+1' to describe sentiment in his thesis, which is not a definition. The professor continues by asking if the candidate could offer a theory where the thesis has been based on. But there is again no answer that fulfills the professor's request. The professor claims that the candidate's thesis only presents numbers, no definitions.

Another professor continues and asks for the 'neutral backbone' that is used in the research to validate the sentiment of certain words. Did the candidate collaborate with psychologists for example? The candidate replies that he collaborated with a company that manually rated the sentiment values of words and sentences. He cannot give a description about how that annotation process has been executed. The professor highlights the importance of an external backbone that is needed in order to be able to give results. Which brings him to his next question. The professor counted 6000 calculations that had been done to confirm the candidate's hypothesis. This 'hypothesis testing' phenomenon is a recurring element in the arguments of the thesis. The candidate is asked if he wasn't over-enthusiastic in his results.

But the jury must also admit that it is quite an achievement to write a doctoral thesis on the topic of text mining at a university where there is no department of linguistics, and neither in computer science. The candidate was located at the Economics department under the 'Erasmus Research Institute of Management'. Though, when the candidate was asked about his plans to fix the gaps in his thesis, he replied with saying that he already had a job in the business, and rewriting his thesis would not be a priority nor his primary interest.


Setting 2

2:11 Now let's start with something that's relatively clear,
2:14 and let's see if it makes sense. 
2:15 See the words that are most typical, most discriminative, most predictive,
2:21 of being female.
2:23 (Laughter)
2:30 Yeah, it's a little bit embarrassing, I'm sorry, but I didn't make this up!
2:34 It's very cliché, but these are the words.
TED-talk-screenshot Lyle-Unger-World-Well-Being-Project predicting-heartdiseases-using-Twitter.png

The video reaches the 2:11 minutes when Lyle Ungar starts his introduction to the first text mining results that he will present his TED audience tonight. Luckily enough he can start with something that is relatively clear: the words that are most typical, most discriminative, most predictive, of being female. Nothing too complicated to start with. He proposes to team up with his audience, to see together if the outcomes make sense. Only 10 seconds later Lyle hits the button on the TED slide remote controller. In the reflection of his glasses appears a bright semblance. The audience slowly starts to titter. Lyle looks up to face his audience. He frowns, turns his head, walks a few steps to the right and sights theatrical and somewhat too loud. While Lyle's posture speaks the language of shameful soreness, a white slide with colorful words appears on the screen. '<3' is typeset in the largest font size, followed by 'excited', 'shopping', 'love you' and 'my hair', surrounded by another +-50 words that together form the shape of a cloud. '(Laughter)', appears in the subtitles. The audience seems to recognize the words, and responds to them with a stiffled laughter. Is it the term 'shopping' that appears so big that is funny? Because it confirms a stereotype? Or is it surprising to see what extreme expressions appear to be typical for being female? Lyle had seen it coming, and quickly excuses himself for the results by saying: I didn't make this up! It's very cliché, but these are the words.

These results are part of a text mining research project of the University of Pennsylvania called the 'World Well Being Project' (WWBP). The project is located at the 'Positive Psychology Center', and aims to measure psychological well-being and physical health by analyzing written language on social media. For the results that Lyle Ungar presented at the TED presentations in Pennsylvania 2015, a group of 66.000 Facebook users were asked to share their messages and posts with the research group, together with their age and gender. They were also asked to fill in the 'big five personality test'. A widely used questionnaire that is used by psychologists to describe human personalities and returns a value for 'openness', 'conscientiousness', 'extraversion', 'agreeableness', and 'neuroticism'. Text mining here is used as a technique to derive information about Facebook users by connecting their word usage to their age, gender and personality profile.


Setting 3

Cqrrelations Guy-de-Pauw-CLiPS-Pattern-introduction small.jpg

Guy de Pauw is in the middle of his presentation, when he calls text mining a technology of shallow understanding. It is a cold week in mid January 2015. The room is filled with 40 artists, researchers, designers, activists, students (among others), of which most are interested in, or working with free software. A lot of the people sit with laptops on their laps, trying to keep up with the speed and amount of information. Not many people in the audience are familiar with text mining techniques, and Guy's presentation is full of text mining jargon. To make as many notes as possible seems to be the best strategy for the moment. In the meanwhile, Guy formulates the fundamental problems that text mining is facing: how to transform text from form to meaning? How to deal with semantics and meaning? And, how can a computer 'understand' natural language without any world knowledge? It is telling how much effort Guy takes to show the problematic points in text understanding practices. In one of his next slides, Guy shows an image where one sentence is interpreted in five different ways. Each version of the sentence pretty little girl's school is illustrated to reveal the different meanings that this short sentence contains. Guy transcribes shortly: “Version one: the pretty school for little girls. Version two: the seemingly little girl and her school. Version three: the beautiful little girl and her school. And so forth.”

From CLiPS-presentations-during-Cqrrelations jan-2015 Brussels-Pretty-little-girls-school.png

A few minutes earlier, Guy showed an image of two wordclouds that represent words, phrases and topics most highly distinguishing females and males. '<3', 'shopping' and 'excited' are labeled as being most typical female. 'Fuck', 'wishes', and 'he' are presented as most typically 'male'. A little rush of indignation moved through the room. 'But, how?!'. You could see question marks rising above many heads. How is this graph constructed? Where does it come from? Guy explained how he is interested in gender detection in a different sense. In the graph, words were connected to themes and topics, whereupon it is only a small step to speak about 'meaning' and what females 'are'. Guy's next slide showed how he is more interested to look at gender in a grammatical way. By analyzing the structures of sentences that are written by females and comparing these to male-written sentences. Then, all there is to say is: women use more relational language and men more informative language.

Shallow understanding? Guy shows the website 'biograph.be' to illustrate his statement. It is a text mining project where connections are drawn between hypotheses of academic papers. The project can be used for prevention, diagnosis or treatment purposes. 'Automated knowledge discovery' is promised to prevent anyone from 'drowning in information'. Guy adds some critical remarks: using this technology in medial contexts “will lead to a fragmentation of the field” as well as to “poor communication between subfields”.

Guy is invited to speak and introduce the group to a text mining software package. The software is called 'Pattern' and developed at the university of Antwerp, where Guy is part of the CLiPS research group: 'Computational Linguistics & Psycholinguistics'. Coming from a linguistic background, the CLiPS research group is approaching their project rather from structural approaches than statistical. This nuance is difficult to grasp when only results are presented. Guy hits a button on his keyboard and his presentation jumps to the next slide. It is an overview of linguistic approaches to text understanding for computers. The slide shows a short bullet-pointed overview. Coming from a knowledge representation approach in the 70s, where sentence structures were described in models that were fed into the computer. Via a knowledge-based approach in the 80s, where corpora were created to recognize sentence structures on a word-level. Word types as 'noun', 'verb' or 'adjective' functioned for example as labels. Towards the period that started in the mid 90s: a statistical and shallow understanding approach. Text understanding became scalable, efficient and robust. Making linguistic models became easier and cheaper. Guy adds immediately a critical remark: is this a phenomenon of scaling up by dumbing down?


links

thesis in progress (overview)

intro &+

chapter 1