User:Cristinac/Day2: Difference between revisions

From XPUB & Lens-Based wiki
No edit summary
No edit summary
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
Day four
CLiPS


Deconstructing Harry
Computational Linguistics & Psycholinguistics
Guttorm Guttormsgaard
Asger Jorn


Guy de Pauw, Walter Daelemans, Tom De Smedt


metadata of information; data gallery; average colour of the image, timestamp, face recognition, average colour data, what could a photo gallery mean?
-Chomsky’s work on linguistics: language acquisition device.
Gaussian blur (many image treatments begin with)
-extracting knowledge from data
derivates


Natural language processing
most information is in unstructured data (text)
data in digital form
big data (too big to handke with conventional means)


http://programmingcomputervision.com/
>90% of currently available data was created in the last 2 years


Problems: accuracy levels, speed, fundamental problems (form-meaning relation, semantics, world knowledge)


1 objective knowledge (machine reading): recognising which word types are used
2 subjective knowledge: sentiment, opinion, emotion, modality, (un)certainty. especially with the advent of social media
3 meta knowledge: authorship, author attributes (educational level,age,gender,personality,region,illness), text attributes (date of writing,..): of the, on the, has been
tf IDF


https://en.wikipedia.org/wiki/Tf%E2%80%93idf


a gradient has a magnitude and a direction (like a vector)
Schwartz HA Eichstaedt JC 2013 Personality, Gender, and Language in the age of Social Media - controversial
http://www.ppc.sas.upenn.edu/socialmediapub.pdf


Search By Image - Sebastian Schmieg
women using more pronouns, men use more determiners and quant ors
-relational language - women
-informative language - men


Control detection
Lexical / morphological, syntactic reasons for bad translations
contours that are detected are not continuous, but they are fragments and there is an extra step that determines what kind of fragments got together and create an extra step
ambiguity. example: snappy little girl’s school, adding only in a sentence, all students know two languages (the same languages?)
paraphrase. inference.
ex. The mayors prohibited the students to demonstrate because they preached the revolution (who is they?)


Volterra Kernel Training/Identification System
Language processing pipeline: text input to meaning output
Deep Understanding
text input —>
tokenization/normalization
lemmatization (reducing words to their meaning in the dictionary)
part-of-speech tagging: determining what meaning/types of words they are
shallow parsing (who is doing what to whom)
modality/negation
word sense disambiguation (ex bank)
semantic role labelling
named-entity recognition
co-reference resolution
—>meaning output


Text mining (shallow understanding)
contents: extract facts (concepts and relations between concepts) and opinions
meta-data


http://www.sciencedirect.com/science/article/pii/S0952197612002461
statistical, not logical model of the face
behind the algorithm is a manual work that is done by people repeatedly
every detail of the face is annotated
labour conditions


Training data to feed the classifier : no image exists in isolation
Text mining (Marti Hearst 2003)
False positive : images that have been selected as containing a face when they don’t
Don Swanson 1981: medical hypothesis generation


example: deception


https://en.wikipedia.org/wiki/Ghostwriter
not everyone is equally successful in deceiving others


Liars use, fewer exclusive words, fewer self and other references, fewer time related words, fewer tentative  words, more space related words, more motion verbs, more negations, more negative and positive emotions


cvdazzle.com - techniques to avoid face detection
Cornell University study (Ott et al 2011)


the same algorithm can be fed with any kind of statistical data; ex: banana recognition
Human judges fail to make the distinction (truth bias), low inter annotator agreement, 2 out of 3 perform at chance level, classifier succeeds (90% accurate). cues: more superlatives, deceptive: imaginative rather than informative language
sort by face
however. there are more than 1 differentiations between the text sources


http://www.cise.ufl.edu/~arunava/papers/cvpr09.pdf
Text categorisation Documents —> Documents—>classes                                  —>text classifier
                                                                            —>linguistically analysed data    —>text classifier


https://en.wikipedia.org/wiki/Volterra_series
looking at what defines an author, and not what he writes about
Hans Moravec diagram




Sentiment mining
sentiment lexicon, classifiers and annotation


CSV-no space for metadata, no authorship information
politiekebarometer.be
https://okfn.org/
www.clips.uantwerpen.be/cqrellations


frictionless data
http://centraldedados.pt/


SAS, R, Python
http://www.clips.ua.ac.be/pages/pattern




adding “I think” at the end of every paragraph
from pattern.web import Twitter
from pattern.en import sentiment


iPython
for tweet in Twitter(language=“en”).search(“#obama”):
print tweet.text
csp files
data sheet functionality of pattern
 
 
Classifiers:
ex: predict the sentiment polarity, predict the position of a face
 
 
training document, class event, bag-of-words
word bigrams,
character trigrams. example: wrong spelling of “exellent”: exe, xel, ell, len, ent
word lemmas
tokenization: Goed!=goed+!
 
 
Annotation=gold standard
 
online learning option
 
annotation process
 
 
 
 
 
concept clusters
 
 
 
Deep Learning
Based on neural networks
encode world knowledge into our vocabulary
queen=king-man+woman
 
word2vec (not deep learning, same principle, similar additional increase) - application
applications: language technology, speech technology, image recognition, recommender systemen
 
 
 
 
 
RUBENS CODE
 
CODE 1:
from pattern.web import PDF
from pattern.en import sentiment, parse
from pattern.db import Datasheet
 
ds = Datasheet()
 
f = open('Bible.pdf')
pdf = PDF(f)
ds.append((pdf.string, 'Bible'))
 
f = open('quran.pdf')
pdf = PDF(f)
ds.append((pdf.string, 'Quran'))
 
ds.save('bible_quran.csv')
 
 
print 'saved!'
 
CODE 2:
from pattern.web import URL, plaintext
from pattern.vector import Document, NB, KNN, SLP, SVM, POLYNOMIAL
from pattern.db import csv
from pattern.en import parse
import math
 
# classifier = SVM(kernel=POLYNOMIAL, degree=10)
classifier = SVM()
 
print 'TRAINING:'
for text, book in csv('bible_quran_torah.csv'):
        length = len(text)
        # part_len = int(math.floor(length/10))
        # print book
        # # print part_len
        # for i in xrange(1,10):
        #        print i
        #        s = text[i*part_len : i*part_len + part_len]
        #        v = Document(parse(s, tokenize=True, lemata=True, tags=False, relations=False, chunks=False), type=book, stopwords=True)
        #        classifier.train(v)
       
        v = Document(parse(text, tokenize=True, lemata=True, tags=False, relations=False, chunks=False), type=book, stopwords=True)
        classifier.train(v)
 
print 'CLASSES:',classifier.classes
 
print 'RESULTS\n======'
 
return_discrete = True
 
print "OBAMA"
s = open("speech_obama.txt").read().replace('\n','')
s = parse(plaintext(s), tokenize=True, lemata=True, tags=False, relations=False, chunks=False)
print classifier.classify(Document(s), discrete=return_discrete)
 
print "OSAMA"
s = open("speech_osama.txt").read().replace('\n','')
s = parse(plaintext(s), tokenize=True, lemata=True, tags=False, relations=False, chunks=False)
print classifier.classify(Document(s), discrete=return_discrete)
 
print "MALCOLM X"
s = open("speech_malcolmx").read().replace('\n','')
s = parse(plaintext(s), tokenize=True, lemata=True, tags=False, relations=False, chunks=False)
print classifier.classify(Document(s), discrete=return_discrete)
 
print "ANITA"
s = open("essay_anita.txt").read().replace('\n','')
s = parse(plaintext(s), tokenize=True, lemata=True, tags=False, relations=False, chunks=False)
print classifier.classify(Document(s), discrete=return_discrete)
 
print "POPE"
s = open("speech_pope.txt").read().replace('\n','')
s = parse(plaintext(s), tokenize=True, lemata=True, tags=False, relations=False, chunks=False)
print classifier.classify(Document(s), discrete=return_discrete)
 
print "NETANYAHU"
s = open("speech_netanyahu.txt").read().replace('\n','')
s = parse(plaintext(s), tokenize=True, lemata=True, tags=False, relations=False, chunks=False)
print classifier.classify(Document(s), discrete=return_discrete)
 
print "LUTHER KING"
s = open("speech_luther-king.txt").read().replace('\n','')
s = parse(plaintext(s), tokenize=True, lemata=True, tags=False, relations=False, chunks=False)
print classifier.classify(Document(s), discrete=return_discrete)
 
print "CQRRELATIONS"
s = open("cqrrelations.txt").read().replace('\n','')
s = parse(plaintext(s), tokenize=True, lemata=True, tags=False, relations=False, chunks=False)
print classifier.classify(Document(s), discrete=return_discrete)

Latest revision as of 12:53, 27 January 2015

CLiPS

Computational Linguistics & Psycholinguistics

Guy de Pauw, Walter Daelemans, Tom De Smedt

-Chomsky’s work on linguistics: language acquisition device. -extracting knowledge from data

Natural language processing most information is in unstructured data (text) data in digital form big data (too big to handke with conventional means)

>90% of currently available data was created in the last 2 years

Problems: accuracy levels, speed, fundamental problems (form-meaning relation, semantics, world knowledge)

1 objective knowledge (machine reading): recognising which word types are used 2 subjective knowledge: sentiment, opinion, emotion, modality, (un)certainty. especially with the advent of social media 3 meta knowledge: authorship, author attributes (educational level,age,gender,personality,region,illness), text attributes (date of writing,..): of the, on the, has been tf IDF

https://en.wikipedia.org/wiki/Tf%E2%80%93idf

Schwartz HA Eichstaedt JC 2013 Personality, Gender, and Language in the age of Social Media - controversial http://www.ppc.sas.upenn.edu/socialmediapub.pdf

women using more pronouns, men use more determiners and quant ors -relational language - women -informative language - men

Lexical / morphological, syntactic reasons for bad translations ambiguity. example: snappy little girl’s school, adding only in a sentence, all students know two languages (the same languages?) paraphrase. inference. ex. The mayors prohibited the students to demonstrate because they preached the revolution (who is they?)

Language processing pipeline: text input to meaning output Deep Understanding text input —> tokenization/normalization lemmatization (reducing words to their meaning in the dictionary) part-of-speech tagging: determining what meaning/types of words they are shallow parsing (who is doing what to whom) modality/negation word sense disambiguation (ex bank) semantic role labelling named-entity recognition co-reference resolution —>meaning output

Text mining (shallow understanding) contents: extract facts (concepts and relations between concepts) and opinions meta-data


Text mining (Marti Hearst 2003) Don Swanson 1981: medical hypothesis generation

example: deception

not everyone is equally successful in deceiving others

Liars use, fewer exclusive words, fewer self and other references, fewer time related words, fewer tentative words, more space related words, more motion verbs, more negations, more negative and positive emotions

Cornell University study (Ott et al 2011)

Human judges fail to make the distinction (truth bias), low inter annotator agreement, 2 out of 3 perform at chance level, classifier succeeds (90% accurate). cues: more superlatives, deceptive: imaginative rather than informative language however. there are more than 1 differentiations between the text sources

Text categorisation Documents —> Documents—>classes —>text classifier

                                                                           —>linguistically analysed data    —>text classifier

looking at what defines an author, and not what he writes about Hans Moravec diagram


Sentiment mining sentiment lexicon, classifiers and annotation

politiekebarometer.be www.clips.uantwerpen.be/cqrellations


SAS, R, Python http://www.clips.ua.ac.be/pages/pattern


from pattern.web import Twitter from pattern.en import sentiment

for tweet in Twitter(language=“en”).search(“#obama”): print tweet.text csp files data sheet functionality of pattern


Classifiers: ex: predict the sentiment polarity, predict the position of a face


training document, class event, bag-of-words word bigrams, character trigrams. example: wrong spelling of “exellent”: exe, xel, ell, len, ent word lemmas tokenization: Goed!=goed+!


Annotation=gold standard

online learning option

annotation process



concept clusters


Deep Learning Based on neural networks encode world knowledge into our vocabulary queen=king-man+woman

word2vec (not deep learning, same principle, similar additional increase) - application applications: language technology, speech technology, image recognition, recommender systemen



RUBENS CODE

CODE 1: from pattern.web import PDF from pattern.en import sentiment, parse from pattern.db import Datasheet

ds = Datasheet()

f = open('Bible.pdf') pdf = PDF(f) ds.append((pdf.string, 'Bible'))

f = open('quran.pdf') pdf = PDF(f) ds.append((pdf.string, 'Quran'))

ds.save('bible_quran.csv')


print 'saved!'

CODE 2: from pattern.web import URL, plaintext from pattern.vector import Document, NB, KNN, SLP, SVM, POLYNOMIAL from pattern.db import csv from pattern.en import parse import math

  1. classifier = SVM(kernel=POLYNOMIAL, degree=10)

classifier = SVM()

print 'TRAINING:' for text, book in csv('bible_quran_torah.csv'):

       length = len(text)
       # part_len = int(math.floor(length/10))
       # print book
       # # print part_len
       # for i in xrange(1,10):
       #         print i
       #         s = text[i*part_len : i*part_len + part_len]
       #         v = Document(parse(s, tokenize=True, lemata=True, tags=False, relations=False, chunks=False), type=book, stopwords=True)
       #         classifier.train(v)
       
       v = Document(parse(text, tokenize=True, lemata=True, tags=False, relations=False, chunks=False), type=book, stopwords=True)
       classifier.train(v)

print 'CLASSES:',classifier.classes

print 'RESULTS\n======'

return_discrete = True

print "OBAMA" s = open("speech_obama.txt").read().replace('\n',) s = parse(plaintext(s), tokenize=True, lemata=True, tags=False, relations=False, chunks=False) print classifier.classify(Document(s), discrete=return_discrete)

print "OSAMA" s = open("speech_osama.txt").read().replace('\n',) s = parse(plaintext(s), tokenize=True, lemata=True, tags=False, relations=False, chunks=False) print classifier.classify(Document(s), discrete=return_discrete)

print "MALCOLM X" s = open("speech_malcolmx").read().replace('\n',) s = parse(plaintext(s), tokenize=True, lemata=True, tags=False, relations=False, chunks=False) print classifier.classify(Document(s), discrete=return_discrete)

print "ANITA" s = open("essay_anita.txt").read().replace('\n',) s = parse(plaintext(s), tokenize=True, lemata=True, tags=False, relations=False, chunks=False) print classifier.classify(Document(s), discrete=return_discrete)

print "POPE" s = open("speech_pope.txt").read().replace('\n',) s = parse(plaintext(s), tokenize=True, lemata=True, tags=False, relations=False, chunks=False) print classifier.classify(Document(s), discrete=return_discrete)

print "NETANYAHU" s = open("speech_netanyahu.txt").read().replace('\n',) s = parse(plaintext(s), tokenize=True, lemata=True, tags=False, relations=False, chunks=False) print classifier.classify(Document(s), discrete=return_discrete)

print "LUTHER KING" s = open("speech_luther-king.txt").read().replace('\n',) s = parse(plaintext(s), tokenize=True, lemata=True, tags=False, relations=False, chunks=False) print classifier.classify(Document(s), discrete=return_discrete)

print "CQRRELATIONS" s = open("cqrrelations.txt").read().replace('\n',) s = parse(plaintext(s), tokenize=True, lemata=True, tags=False, relations=False, chunks=False) print classifier.classify(Document(s), discrete=return_discrete)