User:Manetta/i-could-have-written-that/data-mining-in-the-wild
'knowledge discovery in data' in the wild
problem formulations
hypothesis
The results of data-mining software are not mined, results are constructed.
wider public: the user
- terminology: due to the term data 'mining', data is seen as a material that easily can be extracted from the web. data is framed to be 'raw', and regarded to be a natural resource. but when looking closer, data mining results rather seem to be constructed.
- free labour; users are microworkers, they provide free labour to the services they use. This behavior has been revealed since Edward Snowden's disclosures, and people slowly more aware of the presence of data analytics.
- user-data is sold to third parties (who are these parties? and how is the data offered? in what shape? pre-selected?)
- user-profiles & user-predictions & user-recommendations; data mining is a technology fashion. because of the amount of user-data, data mining is an interesting tool for customized advertisements, search results & sales recommendations. how such results and recommendations are constructed is unclear, which makes it very difficult to disagree with or critize them. how can we speak back to the construction of these results?
- both hardware and software is more and more 'black-boxed', the possibility to check how something works is difficult or made impossible. as most of the software today is running as a service, the user relies on the information from the software company behind the service.
academic, enterprises, technological, specific
- results are easily accepted as objective truths that don't involve human made descisions. but (data mining) algorithms aren’t just technical artifacts, they’re fundamentally human in their design and their use.*
- language is a cultural product, and so is computer language**; it's 'connotative' (ambiguous) → complexes of symbols, providing space for connotation***
- text-processing
- text mining software aims to be a universal text processing system.
- written text can only be processed when it is strongly simplified (into eg. ngrams, bag-of-words or vector-space-models).
- to search for meaningful information in meaningless data, reference datasets as WordNet are functioning as a norm to extract semantic relations in data.
- in the 80s (when employed by IBM), Frederick Jelinek stated: Every time I fire a linguist, the performance of the speech recognizer goes up. (it's labeled as a 'famous quote' on Wikipedia), which gives a certain autonomy to the data. But how does the researcher know when he is 'right'?
* (quoted from: Critical Algorithm Studies: a Reading List)
** (from: Florian Cramer, Language - (2008)
*** (quoted from: Vilem Flusser, Towards a Philosophy of Photography - (1983)
notes
algorithmic agreeability
cross overlap between computer scientists, statisticians & data scientists data mining companies take over the role of (academic?) staticians, with their agenda (driven by profit and efficiency? ..... speculations here).