|
|
Line 1: |
Line 1: |
| '''CLiPS'''
| | Third Day |
|
| |
|
| '''Computational Linguistics & Psycholinguistics'''
| | Critique on the annotators of sentiment analysis/ pedophilia classifier |
| | subjectivity is at the root of things, however it seems to be taken as matter-of-fact |
| | marketing value |
| | making things visible |
| | there is no date of the assessment, no way of finding out more details about the annotation process. |
| | not a process of annotation, but a process of evaluation |
| | http://cs229.stanford.edu/proj2013/ReesmanMcCann-Vehicle%20Detection.pdf |
| | http://cs.stanford.edu/people/karpathy/deepimagesent/ |
| | http://cs.stanford.edu/people/karpathy/deepimagesent/devisagen_arxiv.pdf |
| | http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf |
| | http://machinelearning.wustl.edu/mlpapers/paper_files/icml2010_JiXYY10.pdf |
| | feedback loops: analysis while typing |
| | Otlet’s text; progressive at the time it came out, colonialist, racist(??-writing from memory) now |
| | 1984-a book being referenced a lot without previous knowledge of it, how is it influencing the current discourse |
|
| |
|
| '''Guy de Pauw, Walter Daelemans, Tom De Smedt'''
| | http://eipcp.net/transversal/0106/holmes/en |
| | http://rybn.org/ |
| | http://www.antidatamining.net/ |
| | http://bureaudetudes.org/ |
| | http://theyrule.net/ |
| | http://theyrule.net/drupal/topics/visualization |
| | http://www.nanex.net/ |
| | https://en.wikipedia.org/wiki/Louis_Bachelier |
| | http://littlesis.org/ |
| | http://www.wdgann.com/about-us |
| | http://www.wallstreetandtech.com/trading-technology/after-the-hash-crash-worrying-about-the-next-glitch/a/d-id/1268082 |
| | https://en.wikipedia.org/wiki/Web_Bot |
| | http://predict-market.biz/ |
| | http://asuperstitiousfund.com/ |
| | http://robinhoodcoop.org/ |
| | http://www.corp-lab.com/tradewar/ |
|
| |
|
|
| |
|
| | W D Gann |
|
| |
|
|
| |
|
| -Chomsky’s work on linguistics: language acquisition device.
| |
| -extracting knowledge from data
| |
|
| |
|
| Natural language processing
| | Words for the annotator |
| most information is in unstructured data (text)
| |
| data in digital form
| |
| big data (too big to handke with conventional means)
| |
|
| |
|
| >90% of currently available data was created in the last 2 years
| | -mild vs inflammatory |
| | | -oriented vs disoriented |
| Problems: accuracy levels, speed, fundamental problems (form-meaning relation, semantics, world knowledge)
| |
| | |
| 1 objective knowledge (machine reading): recognising which word types are used
| |
| 2 subjective knowledge: sentiment, opinion, emotion, modality, (un)certainty. especially with the advent of social media
| |
| 3 meta knowledge: authorship, author attributes (educational level,age,gender,personality,region,illness), text attributes (date of writing,..): of the, on the, has been
| |
| tf IDF
| |
| | |
| https://en.wikipedia.org/wiki/Tf%E2%80%93idf
| |
| | |
| Schwartz HA Eichstaedt JC 2013 Personality, Gender, and Language in the age of Social Media - controversial
| |
| http://www.ppc.sas.upenn.edu/socialmediapub.pdf
| |
| | |
| women using more pronouns, men use more determiners and quant ors
| |
| -relational language - women
| |
| -informative language - men
| |
| | |
| Lexical / morphological, syntactic reasons for bad translations
| |
| ambiguity. example: snappy little girl’s school, adding only in a sentence, all students know two languages (the same languages?)
| |
| paraphrase. inference.
| |
| ex. The mayors prohibited the students to demonstrate because they preached the revolution (who is they?)
| |
| | |
| Language processing pipeline: text input to meaning output
| |
| Deep Understanding
| |
| text input —>
| |
| tokenization/normalization
| |
| lemmatization (reducing words to their meaning in the dictionary)
| |
| part-of-speech tagging: determining what meaning/types of words they are
| |
| shallow parsing (who is doing what to whom)
| |
| modality/negation
| |
| word sense disambiguation (ex bank)
| |
| semantic role labelling
| |
| named-entity recognition
| |
| co-reference resolution
| |
| —>meaning output
| |
| | |
| Text mining (shallow understanding)
| |
| contents: extract facts (concepts and relations between concepts) and opinions
| |
| meta-data
| |
| | |
| | |
| Text mining (Marti Hearst 2003)
| |
| Don Swanson 1981: medical hypothesis generation
| |
| | |
| example: deception
| |
| | |
| not everyone is equally successful in deceiving others
| |
| | |
| Liars use, fewer exclusive words, fewer self and other references, fewer time related words, fewer tentative words, more space related words, more motion verbs, more negations, more negative and positive emotions
| |
| | |
| Cornell University study (Ott et al 2011)
| |
| | |
| Human judges fail to make the distinction (truth bias), low inter annotator agreement, 2 out of 3 perform at chance level, classifier succeeds (90% accurate). cues: more superlatives, deceptive: imaginative rather than informative language
| |
| however. there are more than 1 differentiations between the text sources
| |
| | |
| Text categorisation Documents —> Documents—>classes —>text classifier
| |
| —>linguistically analysed data —>text classifier
| |
| | |
| looking at what defines an author, and not what he writes about
| |
| Hans Moravec diagram
| |
| | |
| | |
| Sentiment mining
| |
| sentiment lexicon, classifiers and annotation
| |
| | |
| politiekebarometer.be
| |
| www.clips.uantwerpen.be/cqrellations
| |
| | |
| | |
| SAS, R, Python
| |
| http://www.clips.ua.ac.be/pages/pattern
| |
| | |
| | |
| from pattern.web import Twitter
| |
| from pattern.en import sentiment
| |
| | |
| for tweet in Twitter(language=“en”).search(“#obama”):
| |
| print tweet.text
| |
| csp files
| |
| data sheet functionality of pattern
| |
| | |
| | |
| Classifiers:
| |
| ex: predict the sentiment polarity, predict the position of a face
| |
| | |
| | |
| training document, class event, bag-of-words
| |
| word bigrams,
| |
| character trigrams. example: wrong spelling of “exellent”: exe, xel, ell, len, ent
| |
| word lemmas
| |
| tokenization: Goed!=goed+!
| |
| | |
| | |
| Annotation=gold standard
| |
| | |
| online learning option
| |
| | |
| annotation process
| |
| | |
| | |
| | |
| | |
| | |
| concept clusters
| |
| | |
| | |
| | |
| Deep Learning
| |
| Based on neural networks
| |
| encode world knowledge into our vocabulary
| |
| queen=king-man+woman
| |
| | |
| word2vec (not deep learning, same principle, similar additional increase) - application
| |
| applications: language technology, speech technology, image recognition, recommender systemen
| |
| | |
| | |
| | |
| | |
| | |
| RUBENS CODE
| |
| | |
| CODE 1:
| |
| from pattern.web import PDF
| |
| | |
| from pattern.en import sentiment, parse
| |
| | |
| from pattern.db import Datasheet
| |
| | |
| ds = Datasheet()
| |
| | |
| f = open('Bible.pdf')
| |
| | |
| pdf = PDF(f)
| |
| | |
| ds.append((pdf.string, 'Bible'))
| |
| | |
| f = open('quran.pdf')
| |
| | |
| pdf = PDF(f)
| |
| | |
| ds.append((pdf.string, 'Quran'))
| |
| | |
| ds.save('bible_quran.csv')
| |
| | |
| | |
| print 'saved!'
| |
| | |
| CODE 2:
| |
| | |
| from pattern.web import URL, plaintext
| |
| | |
| from pattern.vector import Document, NB, KNN, SLP, SVM, POLYNOMIAL
| |
| | |
| from pattern.db import csv
| |
| | |
| from pattern.en import parse
| |
| | |
| import math
| |
| | |
| # classifier = SVM(kernel=POLYNOMIAL, degree=10)
| |
| | |
| classifier = SVM()
| |
| | |
| print 'TRAINING:'
| |
| | |
| for text, book in csv('bible_quran_torah.csv'):
| |
| length = len(text)
| |
| # part_len = int(math.floor(length/10))
| |
| # print book
| |
| # # print part_len
| |
| # for i in xrange(1,10):
| |
| # print i
| |
| # s = text[i*part_len : i*part_len + part_len]
| |
| # v = Document(parse(s, tokenize=True, lemata=True, tags=False, relations=False, chunks=False), type=book, stopwords=True)
| |
| # classifier.train(v)
| |
|
| |
| v = Document(parse(text, tokenize=True, lemata=True, tags=False, relations=False, chunks=False), type=book, stopwords=True)
| |
| classifier.train(v)
| |
| | |
| print 'CLASSES:',classifier.classes
| |
| | |
| print 'RESULTS\n======'
| |
| | |
| return_discrete = True
| |
| | |
| print "OBAMA"
| |
| | |
| s = open("speech_obama.txt").read().replace('\n','')
| |
| | |
| s = parse(plaintext(s), tokenize=True, lemata=True, tags=False, relations=False, chunks=False)
| |
| | |
| print classifier.classify(Document(s), discrete=return_discrete)
| |
| | |
| print "OSAMA"
| |
| | |
| s = open("speech_osama.txt").read().replace('\n','')
| |
| | |
| s = parse(plaintext(s), tokenize=True, lemata=True, tags=False, relations=False, chunks=False)
| |
| | |
| print classifier.classify(Document(s), discrete=return_discrete)
| |
| | |
| print "MALCOLM X"
| |
| | |
| s = open("speech_malcolmx").read().replace('\n','')
| |
| | |
| s = parse(plaintext(s), tokenize=True, lemata=True, tags=False, relations=False, chunks=False)
| |
| | |
| print classifier.classify(Document(s), discrete=return_discrete)
| |
| | |
| print "ANITA"
| |
| | |
| s = open("essay_anita.txt").read().replace('\n','')
| |
| | |
| s = parse(plaintext(s), tokenize=True, lemata=True, tags=False, relations=False, chunks=False)
| |
| | |
| print classifier.classify(Document(s), discrete=return_discrete)
| |
| | |
| print "POPE"
| |
| | |
| s = open("speech_pope.txt").read().replace('\n','')
| |
| | |
| s = parse(plaintext(s), tokenize=True, lemata=True, tags=False, relations=False, chunks=False)
| |
| | |
| print classifier.classify(Document(s), discrete=return_discrete)
| |
| | |
| print "NETANYAHU"
| |
| | |
| s = open("speech_netanyahu.txt").read().replace('\n','')
| |
| | |
| s = parse(plaintext(s), tokenize=True, lemata=True, tags=False, relations=False, chunks=False)
| |
| | |
| print classifier.classify(Document(s), discrete=return_discrete)
| |
| | |
| print "LUTHER KING"
| |
| | |
| s = open("speech_luther-king.txt").read().replace('\n','')
| |
| | |
| s = parse(plaintext(s), tokenize=True, lemata=True, tags=False, relations=False, chunks=False)
| |
| | |
| print classifier.classify(Document(s), discrete=return_discrete)
| |
| | |
| print "CQRRELATIONS"
| |
| | |
| s = open("cqrrelations.txt").read().replace('\n','')
| |
| | |
| s = parse(plaintext(s), tokenize=True, lemata=True, tags=False, relations=False, chunks=False)
| |
| | |
| print classifier.classify(Document(s), discrete=return_discrete)
| |