User:Cristinac/Day2: Difference between revisions

From XPUB & Lens-Based wiki
(Created page with "'''CLiPS''' '''Computational Linguistics & Psycholinguistics''' '''Guy de Pauw, Walter Daelemans, Tom De Smedt''' -Chomsky’s work on linguistics: language acquisition ...")
 
No edit summary
Line 1: Line 1:
'''CLiPS'''
Third Day


'''Computational Linguistics & Psycholinguistics'''
Critique on the annotators of sentiment analysis/ pedophilia classifier
subjectivity is at the root of things, however it seems to be taken as matter-of-fact
marketing value
making things visible
there is no date of the assessment, no way of finding out more details about the annotation process.
not a process of annotation, but a process of evaluation
http://cs229.stanford.edu/proj2013/ReesmanMcCann-Vehicle%20Detection.pdf
http://cs.stanford.edu/people/karpathy/deepimagesent/
http://cs.stanford.edu/people/karpathy/deepimagesent/devisagen_arxiv.pdf
http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf
http://machinelearning.wustl.edu/mlpapers/paper_files/icml2010_JiXYY10.pdf
feedback loops: analysis while typing
Otlet’s text; progressive at the time it came out, colonialist, racist(??-writing from memory) now
1984-a book being referenced a lot without previous knowledge of it, how is it influencing the current discourse


'''Guy de Pauw, Walter Daelemans, Tom De Smedt'''
http://eipcp.net/transversal/0106/holmes/en
http://rybn.org/
http://www.antidatamining.net/
http://bureaudetudes.org/
http://theyrule.net/
http://theyrule.net/drupal/topics/visualization
http://www.nanex.net/
https://en.wikipedia.org/wiki/Louis_Bachelier
http://littlesis.org/
http://www.wdgann.com/about-us
http://www.wallstreetandtech.com/trading-technology/after-the-hash-crash-worrying-about-the-next-glitch/a/d-id/1268082
https://en.wikipedia.org/wiki/Web_Bot
http://predict-market.biz/
http://asuperstitiousfund.com/
http://robinhoodcoop.org/
http://www.corp-lab.com/tradewar/




W D Gann




-Chomsky’s work on linguistics: language acquisition device.
-extracting knowledge from data


Natural language processing
Words for the annotator
most information is in unstructured data (text)
data in digital form
big data (too big to handke with conventional means)


>90% of currently available data was created in the last 2 years
-mild vs inflammatory
 
-oriented vs disoriented
Problems: accuracy levels, speed, fundamental problems (form-meaning relation, semantics, world knowledge)
 
1 objective knowledge (machine reading): recognising which word types are used
2 subjective knowledge: sentiment, opinion, emotion, modality, (un)certainty. especially with the advent of social media
3 meta knowledge: authorship, author attributes (educational level,age,gender,personality,region,illness), text attributes (date of writing,..): of the, on the, has been
tf IDF
 
https://en.wikipedia.org/wiki/Tf%E2%80%93idf
 
Schwartz HA Eichstaedt JC 2013 Personality, Gender, and Language in the age of Social Media - controversial
http://www.ppc.sas.upenn.edu/socialmediapub.pdf
 
women using more pronouns, men use more determiners and quant ors
-relational language - women
-informative language - men
 
Lexical / morphological, syntactic reasons for bad translations
ambiguity. example: snappy little girl’s school, adding only in a sentence, all students know two languages (the same languages?)
paraphrase. inference.
ex. The mayors prohibited the students to demonstrate because they preached the revolution (who is they?)
 
Language processing pipeline: text input to meaning output
Deep Understanding
text input —>
tokenization/normalization
lemmatization (reducing words to their meaning in the dictionary)
part-of-speech tagging: determining what meaning/types of words they are
shallow parsing (who is doing what to whom)
modality/negation
word sense disambiguation (ex bank)
semantic role labelling
named-entity recognition
co-reference resolution
—>meaning output
 
Text mining (shallow understanding)
contents: extract facts (concepts and relations between concepts) and opinions
meta-data
 
 
Text mining (Marti Hearst 2003)
Don Swanson 1981: medical hypothesis generation
 
example: deception
 
not everyone is equally successful in deceiving others
 
Liars use, fewer exclusive words, fewer self and other references, fewer time related words, fewer tentative  words, more space related words, more motion verbs, more negations, more negative and positive emotions
 
Cornell University study (Ott et al 2011)
 
Human judges fail to make the distinction (truth bias), low inter annotator agreement, 2 out of 3 perform at chance level, classifier succeeds (90% accurate). cues: more superlatives, deceptive: imaginative rather than informative language
however. there are more than 1 differentiations between the text sources
 
Text categorisation Documents —> Documents—>classes                                  —>text classifier
                                                                            —>linguistically analysed data    —>text classifier
 
looking at what defines an author, and not what he writes about
Hans Moravec diagram
 
 
Sentiment mining
sentiment lexicon, classifiers and annotation
 
politiekebarometer.be
www.clips.uantwerpen.be/cqrellations
 
 
SAS, R, Python
http://www.clips.ua.ac.be/pages/pattern
 
 
from pattern.web import Twitter
from pattern.en import sentiment
 
for tweet in Twitter(language=“en”).search(“#obama”):
print tweet.text
csp files
data sheet functionality of pattern
 
 
Classifiers:
ex: predict the sentiment polarity, predict the position of a face
 
 
training document, class event, bag-of-words
word bigrams,
character trigrams. example: wrong spelling of “exellent”: exe, xel, ell, len, ent
word lemmas
tokenization: Goed!=goed+!
 
 
Annotation=gold standard
 
online learning option
 
annotation process
 
 
 
 
 
concept clusters
 
 
 
Deep Learning
Based on neural networks
encode world knowledge into our vocabulary
queen=king-man+woman
 
word2vec (not deep learning, same principle, similar additional increase) - application
applications: language technology, speech technology, image recognition, recommender systemen
 
 
 
 
 
RUBENS CODE
 
CODE 1:
from pattern.web import PDF
 
from pattern.en import sentiment, parse
 
from pattern.db import Datasheet
 
ds = Datasheet()
 
f = open('Bible.pdf')
 
pdf = PDF(f)
 
ds.append((pdf.string, 'Bible'))
 
f = open('quran.pdf')
 
pdf = PDF(f)
 
ds.append((pdf.string, 'Quran'))
 
ds.save('bible_quran.csv')
 
 
print 'saved!'
 
CODE 2:
 
from pattern.web import URL, plaintext
 
from pattern.vector import Document, NB, KNN, SLP, SVM, POLYNOMIAL
 
from pattern.db import csv
 
from pattern.en import parse
 
import math
 
# classifier = SVM(kernel=POLYNOMIAL, degree=10)
 
classifier = SVM()
 
print 'TRAINING:'
 
for text, book in csv('bible_quran_torah.csv'):
        length = len(text)
        # part_len = int(math.floor(length/10))
        # print book
        # # print part_len
        # for i in xrange(1,10):
        #        print i
        #        s = text[i*part_len : i*part_len + part_len]
        #        v = Document(parse(s, tokenize=True, lemata=True, tags=False, relations=False, chunks=False), type=book, stopwords=True)
        #        classifier.train(v)
       
        v = Document(parse(text, tokenize=True, lemata=True, tags=False, relations=False, chunks=False), type=book, stopwords=True)
        classifier.train(v)
 
print 'CLASSES:',classifier.classes
 
print 'RESULTS\n======'
 
return_discrete = True
 
print "OBAMA"
 
s = open("speech_obama.txt").read().replace('\n','')
 
s = parse(plaintext(s), tokenize=True, lemata=True, tags=False, relations=False, chunks=False)
 
print classifier.classify(Document(s), discrete=return_discrete)
 
print "OSAMA"
 
s = open("speech_osama.txt").read().replace('\n','')
 
s = parse(plaintext(s), tokenize=True, lemata=True, tags=False, relations=False, chunks=False)
 
print classifier.classify(Document(s), discrete=return_discrete)
 
print "MALCOLM X"
 
s = open("speech_malcolmx").read().replace('\n','')
 
s = parse(plaintext(s), tokenize=True, lemata=True, tags=False, relations=False, chunks=False)
 
print classifier.classify(Document(s), discrete=return_discrete)
 
print "ANITA"
 
s = open("essay_anita.txt").read().replace('\n','')
 
s = parse(plaintext(s), tokenize=True, lemata=True, tags=False, relations=False, chunks=False)
 
print classifier.classify(Document(s), discrete=return_discrete)
 
print "POPE"
 
s = open("speech_pope.txt").read().replace('\n','')
 
s = parse(plaintext(s), tokenize=True, lemata=True, tags=False, relations=False, chunks=False)
 
print classifier.classify(Document(s), discrete=return_discrete)
 
print "NETANYAHU"
 
s = open("speech_netanyahu.txt").read().replace('\n','')
 
s = parse(plaintext(s), tokenize=True, lemata=True, tags=False, relations=False, chunks=False)
 
print classifier.classify(Document(s), discrete=return_discrete)
 
print "LUTHER KING"
 
s = open("speech_luther-king.txt").read().replace('\n','')
 
s = parse(plaintext(s), tokenize=True, lemata=True, tags=False, relations=False, chunks=False)
 
print classifier.classify(Document(s), discrete=return_discrete)
 
print "CQRRELATIONS"
 
s = open("cqrrelations.txt").read().replace('\n','')
 
s = parse(plaintext(s), tokenize=True, lemata=True, tags=False, relations=False, chunks=False)
 
print classifier.classify(Document(s), discrete=return_discrete)

Revision as of 11:46, 23 January 2015

Third Day

Critique on the annotators of sentiment analysis/ pedophilia classifier subjectivity is at the root of things, however it seems to be taken as matter-of-fact marketing value making things visible there is no date of the assessment, no way of finding out more details about the annotation process. not a process of annotation, but a process of evaluation http://cs229.stanford.edu/proj2013/ReesmanMcCann-Vehicle%20Detection.pdf http://cs.stanford.edu/people/karpathy/deepimagesent/ http://cs.stanford.edu/people/karpathy/deepimagesent/devisagen_arxiv.pdf http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf http://machinelearning.wustl.edu/mlpapers/paper_files/icml2010_JiXYY10.pdf feedback loops: analysis while typing Otlet’s text; progressive at the time it came out, colonialist, racist(??-writing from memory) now 1984-a book being referenced a lot without previous knowledge of it, how is it influencing the current discourse

http://eipcp.net/transversal/0106/holmes/en http://rybn.org/ http://www.antidatamining.net/ http://bureaudetudes.org/ http://theyrule.net/ http://theyrule.net/drupal/topics/visualization http://www.nanex.net/ https://en.wikipedia.org/wiki/Louis_Bachelier http://littlesis.org/ http://www.wdgann.com/about-us http://www.wallstreetandtech.com/trading-technology/after-the-hash-crash-worrying-about-the-next-glitch/a/d-id/1268082 https://en.wikipedia.org/wiki/Web_Bot http://predict-market.biz/ http://asuperstitiousfund.com/ http://robinhoodcoop.org/ http://www.corp-lab.com/tradewar/


W D Gann


Words for the annotator

-mild vs inflammatory -oriented vs disoriented