User:Fako Berkers/project2

Sniff, Scrape, Crawl

WikiAPI

I have had a look at Wikipedia and I'm interested in categories especially when they include people. You have for instance a category of Marxist Theorist (to stay a little bit in the same genre as last trimester). This page lists all people categorized as Marxist Theorist and nothing else.

I find categories exiting whenever I regard them as communities. The persons listed there may not even be aware of this community, but as a fact some common ideal or subject or whatever binds these persons together.

I would like to sniff, scrape and crawl in a number of ways to reveal these communities to themselves and others. The following possibilities occurred to me when viewing the Wiki API

try to fetch jargon used by a community (or their wiki users/pages)
try different kinds of mapping like (most quoted, highest rank by Google, most backlinks, voted most important by own community, voted most important by critics)
fetch total bibliography of community and make up sorting algorithms
create a "fieldview" by relating the communities of critics to the community being portrait
try a community kickstart by putting email addresses associated with the names on a mailinglist

In the long run small aps like these might build up to article validation. For instance if a text called text.A contains jargon from community.13 then a computer could see to whom described ideas belong to and how these are regarded by other communities (critiques) and the rest of the world (popularity measured through Google ranking)

Article validation may be useful to counter information overload, but I do think that users should always be able to favor certain writers manually. This is to make sure that people choose to ignore or favor certain writing instead of a computer telling people what to read because most people read that.

Wiki Interface

There is the ApiInterface and ScrapeInterface. You have to create an object of these classes and call object.parse(pagetitle) to get the data. The ScrapeInterface will take all CDATA from tags of your choice. You can pass names of the tags (p, h1) as a list to the object initiator. The method parse(pagetitle) will return a list with all fetches CDATA. For the ApiInterface you have to choose a "generator" at creation of the ApiInterface object . The method .parse(pagetitle) will return a list of page dictionaries depending on the generator. An object created with "links" will return all info about pages that are linked on the requested pagetitle.

POSSIBLE GENERATORS

links (pl)
images (im)
templates (tl)
categories (cl)
backlinks (bl)
categorymembers (cm) (call with Category:SomeCategory only)

IMPLEMANTABLE GENERATORS (from instead of titles+continue)

allimages (ai)
allpages (ap)
allcategories (ac)

#!/usr/bin/env python


# Import necessary modules
import urllib2
import json
import HTMLParser #SAX like HTML parser. NOT PYTHON 3.0 COMPATIBLE: PROBLEMS WITH WORD SCRAPING!!
import time


# Will scrape Wikipedia and fetch data (which depends on constructor arguments) in self.collection
class ApiInterface:

    def __init__(self, generator, urlpostfix =""):
        self.url = u'http://en.wikipedia.org/w/api.php?action=query&format=json&redirects=true&prop=info&generator='+generator.encode('utf-8')
        self.urlpostfix = urlpostfix
        self.maxlimit = False
        self.collection = []
        if generator == "backlinks":
            self.pagekey = u"&gbltitle="
        elif generator == "categorymembers":
            self.pagekey = u"&gcmtitle="
        else:
            self.pagekey = u"&titles="
    
    def parse(self,  qpage):
        workurl = self.url+self.pagekey+urlescape(qpage)+self.urlpostfix.encode("utf-8")
        self.collection = []
        
        # DJANGO CACHE CODE HERE
        # Check if workurl is in database.
        # If so ... return data from cache
        
        while workurl:
        
            print "Next up: " + workurl
        
            # Formulate request
            request = urllib2.Request(workurl)
            user_agent = "Mozilla/5.0 (X11; U; Linux x86_64; fr; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5.5"
            request.add_header("User-Agent", user_agent)
            # Make request (try 10 times)
            for i in range(0, 9):
                try:
                    response = urllib2.urlopen(request)
                    break
                except urllib2.URLError:
                    print 
                    time.sleep(3)
                    continue
            else:
                raise urllib2.URLError("URLError timeout, are you connected??")
            # Do the parsing
            data = json.load(response)
            
            # Harvest data
            if data.has_key("query"):
                for page in data["query"]["pages"].itervalues():
                        self.collection.append(page)
            
            # DJANGO CACHE CODE HERE
            # Save collection in database under workurl.
            
            # Continue with parsing if necessary
            workurl = False
            if data.has_key("query-continue"):
                for content in data["query-continue"].itervalues():
                    for key,  value in content.iteritems():
                        if not self.maxlimit: self.setMaxLimit(key)
                        workurl = self.url+self.pagekey+urlescape(qpage)+u'&'+key.encode("utf-8")+u'='+urlescape(value.encode("utf-8"))+self.urlpostfix.encode("utf-8")
                        
        return self.collection
    
    # SHOULD BE PRIVATE
    def setMaxLimit(self,  str):
        pre = str[:-len('continue')]
        self.url = self.url + u'&' + pre.encode("utf-8") + u'limit=500'
        self.maxlimit = True


# Made to scrape a Wikipage and save all <p> content in self.collection
class ScrapeInterface(HTMLParser.HTMLParser):
    
    def __init__(self, tags = [u'h1',u'td', u'span', u'a', u'p', u'li', u'h2', u'h3', u'h4', u'h5', u'h6']):
        HTMLParser.HTMLParser.__init__(self)
        self.url = u"http://en.wikipedia.org/wiki/"
        self.tags = tags
        self.collection = []
        self.record = False
        
    
    def parse(self,  qpage,  cache=False):
        # Prepare scrape
        workurl = self.url+urlescape(qpage)
        self.collection = []
        print "URL scrape: " + workurl
        
        # DJANGO CACHE CODE HERE
        # Check if workurl is in database and has matching self.tags if cache is set to true
        # If so ... return data from cache
        
        # Formulate request
        request = urllib2.Request(workurl)
        user_agent = "Mozilla/5.0 (X11; U; Linux x86_64; fr; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5.5"
        request.add_header("User-Agent", user_agent)
        # Make request (try 10 times)
        for i in range(0, 9):
            try:
                response = urllib2.urlopen(request)
                break
            except urllib2.URLError:
                time.sleep(3)
                continue
        else:
            raise urllib2.URLError("URLError timeout, are you connected??")
        # Do the parsing.
        htmlstring = response.read()
        self.feed(htmlstring) # this will fill self.collection with page content per tag.
        
        # DJANGO CACHE CODE HERE
        # If cache is set to true then save results in database under url and tags.
        
        return self.collection
    
    def handle_starttag(self, tag, attrs):
        if tag in self.tags:
            self.record = True
        else:
            self.record = False
    
    def handle_data(self,  data):
        if self.record: 
            self.collection.append(data)

def urlescape(str):
    try:
        rsl = urllib2.quote(str)
        return rsl.encode("utf-8")
    except KeyError:
        rsl = str.replace(" ", "%20")
        return rsl.encode("utf-8")

First results

I've played around with the code and got some interesting results. By using a simple algorithm on the Wiki data I'm able to relate people. If you give a name to the program, it will calculate who is most likely some kind of colleague and indeed if you're interested in person A the computer can guess you also like G and I (for example). Here's one printout:

Slavoj Zizek:
[(u'Slavoj \u017di\u017eek', 41), (u'Jacques Lacan', 9), (u'Antonio Negri', 8), (u'Kojin Karatani', 8), (u'Judith Butler', 7), (u'Rosa Luxemburg', 7), (u'Jacques Derrida', 7), (u'Chopper Read', 7), 
(u'Bo\u017eidar Debenjak', 7), (u'Victor Menezes', 6), (u'Julia Kristeva', 6), (u'Alexander Toradze', 6), (u'Ale\u0161 Debeljak', 6), (u'Jean-Pierre Jeunet', 6), (u'Luce Irigaray', 6), (u'Boeing 727', 6), (u'Stephen Bronner', 6), 
(u'Rastko Mo\u010dnik', 6), (u'Steve Brookstein', 6), (u'Alain Badiou', 6)]

It's interesting that the algorithm can easily predict itself whether results will be reasonable or bad. The algorithm can use some fine tuning to get rid of the nonsense like Boeing 727 :) I do have idea's on how to do that, but making the calculations already takes up 10 to 20 minutes, so imagine with an improved version ... I'm optimizing before expanding for sure. Django could be my best friend in this.

Without me being aware of it the results lead to some kind of new search engine. I like the emotions that I get while viewing the results. It seems like my attention is brought to interesting new people by using it.

Stage two plans

The most important thing now is to optimize. I assume the URL requests are taking the most time. The program will often make more than a thousand request, because Category:Living_people is often fully investigated. This means it has to go through half a million names. If I would save the category listing with Django in a sort of cache I could create the same results without over asking the connection.

Improved ranking

There's a few interesting things I can do with the code once I optimized with Django to improve the ordering of the results.

I can set up a “control group” for each search and use that data to make common used Categories less important than rare categories. This tool can easily be transformed to filter a vocabulary from Categories (for example the rare categories) which further expands the possibilities.
I could distinguish between related and unrelated categories which may improve the ordering of results (pushing Boeing 727 and irrelevant people to the back) especially when dealing with people less documented.
An alternative to improve results is relating categories and names to categories and names used on the page itself. This may delete results like Boeing 727 and some irrelevant people (like Michael Jackson in results for Albert Einstein).

All options will compliment each other and the first draws the idea potentially into another direction (search on words instead of names). If the Django-cache code is flexible enough it might optimize these improvements as well

However I'm doubting whether I want to reorder the results, because I kinda like the dirtiness (makes me pleasantly surprised). I would then however like to know why something like Boeing 727 was associated with Zizek. This could be done with an improved printing procedure.

Application one: community kickstarter

I thought about trying to bring a community to life by harvesting email addresses or other contact details and putting the people in a category into contact with each other. This could “kickstart” a community. Such communities could for instance work on a liquid publication. Marxists might be interested in working together on questions such as: <blockquotes> Can the traditional, liberal notion of ownership be retained at all in this framework [of the liquid book], and -- if we were to forgo it -- what would it mean for our established ideas of social exchange, economy, property, profit-making, and capitalism itself? Naturally, this raises a number of serious political and ethical questions that -- utopian as it may sound -- have the potential to reshape the very socio-political order in which we operate. </blockquotes> I'm still unsure whether to go through with this. It is easily done, but how to respond if a few start responding? How to direct without interfering too much??

Application two: personal persons

I also thought about writing a browser add-on and linking all the traffic going through the browser to a server port. This port will then record all visited URL's and analyse them on names present. The result will be a database that contains an image of the persons and kind of content that I'm interested in and (more importantly) will contain an image of people that I do not regularly read but may have interest in. This installation will take my attention towards possible role models and at least other people instead of mobile phone discounts. The installation might produce a RSS feed that I can respond to by clicking it's links which will give feedback to the installation.

Application three: search engine

I could also try to put this as a service online (including some of my planned improvements) and see what will happen with it.

Prospects: crowd sourcing

RSS feedback loop system within community Feedback to Wiki community (critique pages/templates?) Improve English with Dutch grammar hCard??

Critique

A point of possible critique is that Wikipedia is not for the common people and the same may be true for this algorithm. It might only be useful for people like me, who only know a little of a lot and are curious for more.

Michael Jackson

[(u'Michael Jackson', 60), (u'Jermaine Jackson', 20), (u'Janet Jackson', 19), (u'Stevie Wonder', 19), (u'Prince (musician)', 18), (u'Madonna (entertainer)', 18), (u'Justin Timberlake', 17), (u'Bob Dylan', 16), 
(u'Paul McCartney', 16), (u'Tina Turner', 16), (u'Marlon Jackson', 16), (u'La Toya Jackson', 15), (u'Mariah Carey', 15), (u'Lionel Richie', 15), (u'Britney Spears', 15), (u'Whitney Houston', 15), (u'Diana Ross', 15), 
(u'Little Richard', 15), (u'Usher (entertainer)', 14), (u'Christina Aguilera', 14)]