chapter 3 – training truth (from 'mining' to Knowledge Discovery in Databases (KDD))

'mining' as metaphor

To refer to results that are created through means of statistical computation, the metaphor of text- or data 'mining' is often used to refer to actual creation of the results. What side-effects do occur when this metaphor is used?

[examples from (popular) media here]

Knowledge Discovery in Data steps (1989)

The metaphor of 'to mine' or 'mining' is often used these days by both corporations [REF?] but also in the academic field [REF?], to point to a process of information gathering from a large set of data. The term became more popular for businesses and corporations (Piatetsky-Shapiro & Parker, 2011)1. But there has been attempts to use a different term already in a workshop in 1989 in Detroit. During this event the term Knowledge Discovery in Databases (KDD) was coined by Gregory Piatetsky-Shapiro. This term describes the full process from selecting input data to the interpretations of results in different steps.

The initial KDD model was coined by Fayyad & Piatetsky-Shapiro & Smyth (1996). It did not only describe the data-related steps in the process:

step 1 → identifying the goal
step 2 → creating a target data set
step 3 → cleaning & preprocessing
step 4 → data reduction & projection
step 5 → matching the goals (step 1) to a particular data-mining method
step 6 → exploratory analysis of model selection
step 7 → data mining
step 8 → interpreting mined patterns
step 9 → acting on the discovered knowledge

note: “The KDD process can involve significant iteration and can contain loops between any two steps”. (see image below)

initial KDD steps described by Piatesky-Shapiro (1996)

This model of the KDD steps starts with the identification of the goal that is aimed for, which puts the focus immediately on the presence of a data analyst. This model is published in 1996 and written with the aim to produce a more nuanced framework for the techniques that were overwhelmed by a wave of excitement in the 90s – already! It is telling that the model by Calders & Custers – published in 2013 – puts its focus much less on the analyst, and immediately on the data. In a time where 'data science' is an official department title in universities since 2011, data is placed central. The act of collecting data is presented as the very first step of the process.

Although 'data' is crucial material for the whole process, 'data mining' is only one of the multiple steps in all the KDD steps versions. The 'data mining' step is the moment in the process where the analyst searches for patterns in the data, by using one of the many different algorithms that are available3. Although this step is crucial, it is not possible to execute pattern recognition without collecting data and preparing it to be useful. By using the term 'data mining' to refer to the whole process of creating information out of data, many human decisions and subjective choices are placed in the shadow of the algorithmic calculations on the data. The term 'data mining' functions then as an objective curtain for a more tumultuous process.

The moment of 'data mining' and finding patterns in the data is the most complex KDD step, and therefore difficult to fully understand. The way that these algorithms actually detect patterns in the data is even for many analysts rather vague and unclear. Different algorithms are applied on the dataset and their outcomes are compared (Hans Lammerant, how to ref a workshop moment?). Because the actual behaviour of the algorithms is complex, a trial-and-error method is needed to come to results – which is already described by the makers of the model from 1996. In this way, the KDD steps become a system in which an analyst can always take a few steps back to increase the quality of his or her results.

IBM's version of the iterative nature of a data-mining process source

diagram to sketch the position of expectations & a point of departure

By using the term 'data' mining to refer to a full information creating process, it becomes almost impossible for a wider public to discuss outcomes or even disagree with them. As the exact behaviour of an algorithm is difficult to grasp, the 'data mining' step is a complex subject for a conversation. Both analysts and a broader public that is interested to know more would have difficulties to describe what calculations are made by the software, and why.

If it is important to speak about the constructions of results, the KDD steps should be the framework for analysts to document their findings. The initial model from 1996 includes not only data-treating steps, but also the subjective acts as 'identifying' and 'exploratory analysis'. It will take the moments of subjective human intervention out of the shadow of complex algorithmic calculations. It could bring decisions and human subjectivity back again in conversations about these systems. System that create information, useful information, or – also called – 'knowledge'.

KDD steps (2013)

Different variations of these KDD steps are circulating. One of them is described in (Calders & Custers, 2013, p.7):

step 1 → data collection
step 2 → data preparation
step 3 → data mining
step 4 →  interpretation
step 5 → determine actions

The model of Calders & Custers is named Knowledge Discovery in Data steps, which is a small variation on the discovery of knowledge in Databases. The other thing that stands out is that they use rather objective descriptions of the steps. Their KDD-steps start with the collection of data. They thereby bypass the moment where the analyst sets its intentions that leads to the goal of the KDD process (Fayyad & Piatetsky-Shapiro & Smyth, 1996). Step 2 is called 'data preparation' and is described as the act of 'rearranging and ordering' of the data (Calders & Custers, 2013, p.8), which does make it sound like an step which does not touch the data itself, but merely its order and categories. They thereby leave aside that the act of preprocessing can also be referred to as the moment where the data is 'cleaned up'. Such cleaning operations include “removing noise if appropriate, collecting the necessary information to model or account for noise, deciding on strategies for handling missing data fields, and accounting for time-sequence information and known changes” (Fayyad & Piatetsky-Shapiro & Smyth, 1996 p.42).

Only step 4, where the 'interpretation' of the found patterns takes place, is pointing to a moment of doubt, uncertainty or at least human subjectivity, where the analyst needs to interpret the output of the algorithm. This is the moment where results are rated as either useful or wrong. But useful or wrong according to what?

conclusion

As opposed to data mining the term Knowledge Discovery in Databases enables to create a better differentiation between computer processes and human decision moments.

to add to this chapter: examples from a text mining software package

--> add about tweaking aspect to get the script working, where exhaustion and frustrations also play a part. --> add elements of modality.py + tweaking element of uncertainty treshold.

threshold tweaking

def uncertain(sentence, threshold=0.5):
    return modality(sentence) <= threshold

The modality function is followed by a super short function to decide where the border between certainty and uncertainty exactly lies. The function uncertain() can be used to quickly return a 'yes' or 'no'. The border between certainty and uncertainty is set here at exactly '0.5', a number that also can be translated into the words 'a half'. Is this coincidence? Next to that is the number '0.5' relatively high according to the full range of -1.00 to +1.00. It means that according to the uncertain() function, there is a higher chance that a sentence is uncertain, and only a small amount of sentences could be called certain.