parsing

text is unstructured, amorphous, and difficult to deal with. (...) The motivation for trying to extract information from it is compelling—even if success is only partial. (Witten 2011, p.386)

In other words, by "unstructured" it is meant: unstructured in relation to the machine -- that is, not explicitly structured in a format directly amenable to use by automated means. (Murtaugh 2016, A_bag_but_is_language_nothing_of_words)

Weka 3

Pattern

notes

The computer scientists view of textual content as "unstructured", be it in a webpage or the pages of a scanned text, underscore / reflect the negligence to the processes and labor of writing, editing, design, layout, typesetting, and eventually publishing, collecting and cataloging (Murtaugh 2016, A_bag_but_is_language_nothing_of_words)

The superficial similarity between text and data mining conceals real differences. In the Preface (page xxi), we characterized data mining as the extraction of implicit, previously unknown, and potentially useful information from data. With text mining, however, the information to be extracted is clearly and explicitly stated in the text. It is not hidden at all—most authors go to great pains to make sure that they express themselves clearly and unambiguously. From a human point of view, the only sense in which it is “previously unknown” is that time restrictions make it infeasible for people to read the text themselves. (Witten 2011, p. 386)

User:Manetta/i-could-have-written-that/text-processing/simplification

Contents

parsing

Weka 3

Pattern

notes

gallery