User:Fako Berkers/project2: Difference between revisions

From XPUB & Lens-Based wiki
 
(18 intermediate revisions by the same user not shown)
Line 1: Line 1:
==Sniff, Scrape, Crawl==
==Sniff, Scrape, Crawl==


===WikiAPI===
===Wicked Wiki===
I started the year with a scraper/tool that can investigate persons on Wikipedia. At the moment it can find collegues based on simularities in categorization. This is a great tool, because if you find one person interesting you are likely to appreciate the others. Investigating Ghandi for instance delivers some unknown Indian rulers and peace activists. I'm still working on improvements. You can read more [[wickedwiki|here]].


I have had a look at Wikipedia and I'm interested in categories especially when they include people. You have for instance a category of Marxist Theorist ([http://en.wikipedia.org/wiki/Category:Marxist_theorists to stay a little bit in the same genre as last trimester]). This page lists all people categorized as Marxist Theorist and nothing else.
===Anecdote: from WW to AA===


I find categories exiting whenever '''I regard them as communities'''. The persons listed there may not even be aware of this community, but as a fact some common ideal or subject or whatever binds these persons together.
When I was on my way to Berlin to visit Transmediale I was thinking about my Wicked Wiki idea. The program was slow, but functional and I was looking for a next step. For me this was in visualizing results. Something that concerned me was that a lot of people on Wikipedia disappear into the unknown. That's why a wanted to make superstars out of people that were on Wikipedia but not famous.


I would like to sniff, scrape and crawl in a number of ways to reveal these communities to themselves and others. The following possibilities occurred to me when viewing the Wiki API
When I got back from Berlin the results of my test subject (Slavoj Zizek) were poluted with "German Nachste Superstar", the German Idol program on television. Quite baffeled by this coincedent I started to work on a program that visualized the battle between Zizek and Lombardi (one of the German Idol participants). This grew into the much bigger tool that is now Attention Arena
* try to fetch jargon used by a community (or their wiki users/pages)
* try different kinds of mapping like (most quoted, highest rank by Google, most backlinks, voted most important by own community, voted most important by critics)
* fetch total bibliography of community and make up sorting algorithms
* create a "fieldview" by relating the communities of critics to the community being portrait
* try a community kickstart by putting email addresses associated with the names on a mailinglist


In the long run small aps like these might build up to '''article validation'''. For instance if a text called text.A contains jargon from community.13 then a computer could see to whom described ideas belong to and how these are regarded by other communities (critiques) and the rest of the world (popularity measured through Google ranking)
===Attention Arena===


Article validation may be useful to counter information overload, but I do think that users should always be able to favor certain writers manually. This is to make sure that people choose to ignore or favor certain writing instead of a computer telling people what to read because most people read that.
I'm interested in how internet regulates our attention. Some subjects may be very interesting, but because the majority doesn't watch these subjects they fade away on the internet being replaced by what others are watching. Attention for some interesting subjects will disappear because of this.


===Wiki Interface===
I started to visualize this by putting two YouTube videos on top of each other fading the least popular video. This worked well with a video from Zizek and Lombardi, a philosopher and teenage superstar respectively.


There is the ApiInterface and ScrapeInterface. You have to create an object of these classes and call object.parse(pagetitle) to get the data.
Spurred on by this nice result I started to work on a tool that does the following:
The '''ScrapeInterface''' will take all CDATA from tags of your choice. You can pass names of the tags (p, h1) as a list to the object initiator. The method parse(pagetitle) will return a list with all fetches CDATA.  
* It allows a user to create a list of Youtube videos in Django Administration.
For the '''ApiInterface''' you have to choose a "generator" at creation of the ApiInterface object . The method .parse(pagetitle) will return a list of page dictionaries depending on the generator. An object created with "links" will return all info about pages that are linked on the requested pagetitle.
* These videos get downloaded automatically and converted for manipulation purposes if you run a script.
* The videos can be shown in OpenFrameworks.
* With Python you can save preferences for viewing in Django like: size, position, transparency and number of videos


POSSIBLE GENERATORS
My plan is to create a pyramid of six videos. The top video will be the most popular one and the lower ones less popular. The lower the video is on the pyramid, the more faded the video appears. This setup shows what is popular and what is not. It reveils where the attention of the public is heading and what slowly will be forgotten.
* links (pl)
* images (im)
* templates (tl)
* categories (cl)
* backlinks (bl)
* categorymembers (cm) (call with Category:SomeCategory only)


IMPLEMANTABLE GENERATORS (from instead of titles+continue)
To write an algorythm that creates this pyramid will not be too hard. It is a challange to find a good collection of Youtube videos that strengthen my concept behind the form. This is where the work is at now.
* allimages (ai)
* allpages (ap)
* allcategories (ac)


<source lang="python">
===How Bieber is Your Hero??===
#!/usr/bin/env python


For the open day I created a website where you can search for your hero. I got inspired by this comment on Youtube:


# Import necessary modules
''NetaJi was inspired with Swami Vivekananda , who do we make our role models?...Actors or Sports stars...and its not our fault , the actors at least pretend to be heroes , and sports stars bring some pride( however small) to the country, we choose them because there are no real Heroes left , Look what the politicians have made of us , Netaji was considered a terrorist up till the 70's by the Cong. Govt. while Rajiv & Indira are considered Great DeshBhaktas.''
import urllib2
import json
import HTMLParser #SAX like HTML parser. NOT PYTHON 3.0 COMPATIBLE: PROBLEMS WITH WORD SCRAPING!!
import time


This is very vague for a non-indian person, but what I get from it is that our hero's only consist of artists and sports people. This is exactly what I found when I was looking at Category:Living_people on Wikipedia. The best documented people were part of popular culture; either sports or singing. Yet they don't really inspire me and they are part of a culture that you will get to know whether you want to or not anyway.


# Will scrape Wikipedia and fetch data (which depends on constructor arguments) in self.collection
The website is a place where you can spend your time with your real hero's (from popular culture or not). Through the Wicked Wiki software it finds colleagues of your hero. This is exciting, because you may never have heard from these people! At the same time these hero's are put into the perspective of popular culture by comparing their popularity with the most popular artist on Youtube at this moment. Every hero is a certain percentage of Bieber on the website, which is calculated by relating Youtube views of Bieber and your hero. Bieber has more than 500 million Youtube views and because of this, most hero's don't make a 1% Bieber rating.
class ApiInterface:


    def __init__(self, generator, urlpostfix =""):
The "How Bieber is Your Hero" website is a tool that can help you diminish the information overload. At the same time it makes you aware about the relation between your references and the icons in popular culture.
        self.url = u'http://en.wikipedia.org/w/api.php?action=query&format=json&redirects=true&prop=info&generator='+generator.encode('utf-8')
        self.urlpostfix = urlpostfix
        self.maxlimit = False
        self.collection = []
        if generator == "backlinks":
            self.pagekey = u"&gbltitle="
        elif generator == "categorymembers":
            self.pagekey = u"&gcmtitle="
        else:
            self.pagekey = u"&titles="
   
    def parse(self,  qpage):
        workurl = self.url+self.pagekey+urlescape(qpage)+self.urlpostfix.encode("utf-8")
        self.collection = []
       
        # DJANGO CACHE CODE HERE
        # Check if workurl is in database.
        # If so ... return data from cache
       
        while workurl:
       
            print "Next up: " + workurl
       
            # Formulate request
            request = urllib2.Request(workurl)
            user_agent = "Mozilla/5.0 (X11; U; Linux x86_64; fr; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5.5"
            request.add_header("User-Agent", user_agent)
            # Make request (try 10 times)
            for i in range(0, 9):
                try:
                    response = urllib2.urlopen(request)
                    break
                except urllib2.URLError:
                    print
                    time.sleep(3)
                    continue
            else:
                raise urllib2.URLError("URLError timeout, are you connected??")
            # Do the parsing
            data = json.load(response)
           
            # Harvest data
            if data.has_key("query"):
                for page in data["query"]["pages"].itervalues():
                        self.collection.append(page)
           
            # DJANGO CACHE CODE HERE
            # Save collection in database under workurl.
           
            # Continue with parsing if necessary
            workurl = False
            if data.has_key("query-continue"):
                for content in data["query-continue"].itervalues():
                    for key,  value in content.iteritems():
                        if not self.maxlimit: self.setMaxLimit(key)
                        workurl = self.url+self.pagekey+urlescape(qpage)+u'&'+key.encode("utf-8")+u'='+urlescape(value.encode("utf-8"))+self.urlpostfix.encode("utf-8")
                       
        return self.collection
   
    # SHOULD BE PRIVATE
    def setMaxLimit(self,  str):
        pre = str[:-len('continue')]
        self.url = self.url + u'&' + pre.encode("utf-8") + u'limit=500'
        self.maxlimit = True
 
 
# Made to scrape a Wikipage and save all <p> content in self.collection
class ScrapeInterface(HTMLParser.HTMLParser):
   
    def __init__(self, tags = [u'h1',u'td', u'span', u'a', u'p', u'li', u'h2', u'h3', u'h4', u'h5', u'h6']):
        HTMLParser.HTMLParser.__init__(self)
        self.url = u"http://en.wikipedia.org/wiki/"
        self.tags = tags
        self.collection = []
        self.record = False
       
   
    def parse(self,  qpage,  cache=False):
        # Prepare scrape
        workurl = self.url+urlescape(qpage)
        self.collection = []
        print "URL scrape: " + workurl
       
        # DJANGO CACHE CODE HERE
        # Check if workurl is in database and has matching self.tags if cache is set to true
        # If so ... return data from cache
       
        # Formulate request
        request = urllib2.Request(workurl)
        user_agent = "Mozilla/5.0 (X11; U; Linux x86_64; fr; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5.5"
        request.add_header("User-Agent", user_agent)
        # Make request (try 10 times)
        for i in range(0, 9):
            try:
                response = urllib2.urlopen(request)
                break
            except urllib2.URLError:
                time.sleep(3)
                continue
        else:
            raise urllib2.URLError("URLError timeout, are you connected??")
        # Do the parsing.
        htmlstring = response.read()
        self.feed(htmlstring) # this will fill self.collection with page content per tag.
       
        # DJANGO CACHE CODE HERE
        # If cache is set to true then save results in database under url and tags.
       
        return self.collection
   
    def handle_starttag(self, tag, attrs):
        if tag in self.tags:
            self.record = True
        else:
            self.record = False
   
    def handle_data(self,  data):
        if self.record:
            self.collection.append(data)
 
def urlescape(str):
    try:
        rsl = urllib2.quote(str)
        return rsl.encode("utf-8")
    except KeyError:
        rsl = str.replace(" ", "%20")
        return rsl.encode("utf-8")
</source>
 
===First results===
 
I've played around with the code and got some '''interesting''' results. By using a simple algorithm on the Wiki data I'm able to relate people. If you give a name to the program, it will calculate who is most likely some kind of colleague and indeed if you're interested in person A the computer can guess you also like G and I (for example). Here's one printout:
 
<source lang="python">
Slavoj Zizek:
[(u'Slavoj \u017di\u017eek', 41), (u'Jacques Lacan', 9), (u'Antonio Negri', 8), (u'Kojin Karatani', 8), (u'Judith Butler', 7), (u'Rosa Luxemburg', 7), (u'Jacques Derrida', 7), (u'Chopper Read', 7),
(u'Bo\u017eidar Debenjak', 7), (u'Victor Menezes', 6), (u'Julia Kristeva', 6), (u'Alexander Toradze', 6), (u'Ale\u0161 Debeljak', 6), (u'Jean-Pierre Jeunet', 6), (u'Luce Irigaray', 6), (u'Boeing 727', 6), (u'Stephen Bronner', 6),
(u'Rastko Mo\u010dnik', 6), (u'Steve Brookstein', 6), (u'Alain Badiou', 6)]
</source>
 
It's interesting that the algorithm can easily predict itself whether results will be reasonable or bad.
The algorithm can use some fine tuning to get rid of the nonsense like  Boeing 727 :)
I do have idea's on how to do that, but making the calculations already takes up 10 to 20 minutes, so imagine with an improved version ... I'm optimizing before expanding for sure. Django could be my best friend in this.
 
Without me being aware of it the results lead to some kind of new search engine. I like the emotions that I get while viewing the results. It seems like my attention is brought to interesting new people by using it.
 
===Stage two plans===
 
The most important thing now is to '''optimize'''. I assume the URL requests are taking the most time. The program will often make more than a thousand request, because Category:Living_people is often fully investigated. This means it has to go through half a million names. If I would save the category listing with Django in a sort of cache I could create the same results without over asking the connection.
 
====Improved ranking====
 
There's a few interesting things I can do with the code once I optimized with Django to improve the ordering of the results.
* I can set up a “control group” for each search and use that data to make common used Categories less important than rare categories. This tool can easily be transformed to filter a vocabulary from Categories (for example the rare categories) which further expands the possibilities.
* I could distinguish between related and unrelated categories which may improve the ordering of results (pushing Boeing 727 and irrelevant people to the back) especially when dealing with people less documented.
* An alternative to improve results is relating categories and names to categories and names used on the  page itself. This may delete results like Boeing 727 and some irrelevant people (like Michael Jackson in results for Albert Einstein).
All options will compliment each other and the first draws the idea potentially into another direction (search on words instead of names). If the Django-cache code is flexible enough it might optimize these improvements as well
 
However I'm doubting whether I want to reorder the results, because I kinda like the dirtiness (makes me pleasantly surprised). I would then however like to know why something like Boeing 727 was associated with Zizek. This could be done with an improved printing procedure.
 
====Application one: community kickstarter====
 
I thought about trying to bring a community to life by harvesting email addresses or other contact details and putting the people in a category into contact with each other. This could '''“kickstart”'''  a community. Such communities could for instance work on a '''liquid publication'''. Marxists might be interested in working together on questions such as:
<code>
Can the traditional, liberal notion of ownership be retained at all in this framework [of the liquid book], and -- if we were to forgo it -- what would it mean for our established ideas of social exchange, economy, property, profit-making, and capitalism itself? Naturally, this raises a number of serious political and ethical questions that -- utopian as it may sound -- have the potential to reshape the very socio-political order in which we operate.
</code>
I'm still unsure whether to go through with this. It is easily done, but how to respond if a few start responding? How to direct without interfering too much??
 
====Application two: personal persons====
 
I also thought about writing a browser add-on and linking all the traffic going through the browser to a server port. This port will then '''record all visited URL's and analyse them on names present'''. The result will be a database that contains an image of the persons and kind of content that I'm interested in and (more importantly) will contain an image of people that I do not regularly read but may have interest in. '''This installation will take my attention towards possible role models and at least other people''' instead of mobile phone discounts. The installation might produce a '''RSS feed''' that I can respond to by clicking it's links which will give '''feedback''' to the installation.
 
====Application three: search engine====
 
I could also try to put this as a '''service''' online (including some of my planned improvements) and see what will happen with it.
 
====Prospects: crowd sourcing====
 
RSS feedback loop system within community
Feedback to Wiki community (critique pages/templates?)
Improve English with Dutch grammar
hCard??
 
====Critique====
 
A point of possible critique is that Wikipedia is not for the common people and the same may be true for this algorithm. It might only be useful for people like me, who only know a little of a lot and are curious for more.
 
Michael Jackson
<source lang="python">
[(u'Michael Jackson', 60), (u'Jermaine Jackson', 20), (u'Janet Jackson', 19), (u'Stevie Wonder', 19), (u'Prince (musician)', 18), (u'Madonna (entertainer)', 18), (u'Justin Timberlake', 17), (u'Bob Dylan', 16),
(u'Paul McCartney', 16), (u'Tina Turner', 16), (u'Marlon Jackson', 16), (u'La Toya Jackson', 15), (u'Mariah Carey', 15), (u'Lionel Richie', 15), (u'Britney Spears', 15), (u'Whitney Houston', 15), (u'Diana Ross', 15),
(u'Little Richard', 15), (u'Usher (entertainer)', 14), (u'Christina Aguilera', 14)]
</source>

Latest revision as of 19:54, 22 May 2011

Sniff, Scrape, Crawl

Wicked Wiki

I started the year with a scraper/tool that can investigate persons on Wikipedia. At the moment it can find collegues based on simularities in categorization. This is a great tool, because if you find one person interesting you are likely to appreciate the others. Investigating Ghandi for instance delivers some unknown Indian rulers and peace activists. I'm still working on improvements. You can read more here.

Anecdote: from WW to AA

When I was on my way to Berlin to visit Transmediale I was thinking about my Wicked Wiki idea. The program was slow, but functional and I was looking for a next step. For me this was in visualizing results. Something that concerned me was that a lot of people on Wikipedia disappear into the unknown. That's why a wanted to make superstars out of people that were on Wikipedia but not famous.

When I got back from Berlin the results of my test subject (Slavoj Zizek) were poluted with "German Nachste Superstar", the German Idol program on television. Quite baffeled by this coincedent I started to work on a program that visualized the battle between Zizek and Lombardi (one of the German Idol participants). This grew into the much bigger tool that is now Attention Arena

Attention Arena

I'm interested in how internet regulates our attention. Some subjects may be very interesting, but because the majority doesn't watch these subjects they fade away on the internet being replaced by what others are watching. Attention for some interesting subjects will disappear because of this.

I started to visualize this by putting two YouTube videos on top of each other fading the least popular video. This worked well with a video from Zizek and Lombardi, a philosopher and teenage superstar respectively.

Spurred on by this nice result I started to work on a tool that does the following:

  • It allows a user to create a list of Youtube videos in Django Administration.
  • These videos get downloaded automatically and converted for manipulation purposes if you run a script.
  • The videos can be shown in OpenFrameworks.
  • With Python you can save preferences for viewing in Django like: size, position, transparency and number of videos

My plan is to create a pyramid of six videos. The top video will be the most popular one and the lower ones less popular. The lower the video is on the pyramid, the more faded the video appears. This setup shows what is popular and what is not. It reveils where the attention of the public is heading and what slowly will be forgotten.

To write an algorythm that creates this pyramid will not be too hard. It is a challange to find a good collection of Youtube videos that strengthen my concept behind the form. This is where the work is at now.

How Bieber is Your Hero??

For the open day I created a website where you can search for your hero. I got inspired by this comment on Youtube:

NetaJi was inspired with Swami Vivekananda , who do we make our role models?...Actors or Sports stars...and its not our fault , the actors at least pretend to be heroes , and sports stars bring some pride( however small) to the country, we choose them because there are no real Heroes left , Look what the politicians have made of us , Netaji was considered a terrorist up till the 70's by the Cong. Govt. while Rajiv & Indira are considered Great DeshBhaktas.

This is very vague for a non-indian person, but what I get from it is that our hero's only consist of artists and sports people. This is exactly what I found when I was looking at Category:Living_people on Wikipedia. The best documented people were part of popular culture; either sports or singing. Yet they don't really inspire me and they are part of a culture that you will get to know whether you want to or not anyway.

The website is a place where you can spend your time with your real hero's (from popular culture or not). Through the Wicked Wiki software it finds colleagues of your hero. This is exciting, because you may never have heard from these people! At the same time these hero's are put into the perspective of popular culture by comparing their popularity with the most popular artist on Youtube at this moment. Every hero is a certain percentage of Bieber on the website, which is calculated by relating Youtube views of Bieber and your hero. Bieber has more than 500 million Youtube views and because of this, most hero's don't make a 1% Bieber rating.

The "How Bieber is Your Hero" website is a tool that can help you diminish the information overload. At the same time it makes you aware about the relation between your references and the icons in popular culture.