User:Fako Berkers/project2: Difference between revisions

Revision as of 20:49, 24 January 2011

Sniff, Scrape, Crawl

WikiAPI

I have had a look at Wikipedia and I'm interested in categories especially when they include people. You have for instance a category of Marxist Theorist (to stay a little bit in the same genre as last trimester). This page lists all people categorized as Marxist Theorist and nothing else.

I find categories exiting whenever I regard them as communities. The persons listed there may not even be aware of this community, but as a fact some common ideal or subject or whatever binds these persons together.

I would like to sniff, scrape and crawl in a number of ways to reveal these communities to themselves and others. The following possibilities occurred to me when viewing the Wiki API

try to fetch jargon used by a community (or their wiki users/pages)
try different kinds of mapping like (most quoted, highest rank by Google, most backlinks, voted most important by own community, voted most important by critics)
fetch total bibliography of community and make up sorting algorithms
create a "fieldview" by relating the communities of critics to the community being portrait
try a community kickstart by putting email addresses associated with the names on a mailinglist

In the long run small aps like these might build up to article validation. For instance if a text called text.A contains jargon from community.13 then a computer could see to whom described ideas belong to and how these are regarded by other communities (critiques) and the rest of the world (popularity measured through Google ranking)

Article validation may be useful to counter information overload, but I do think that users should always be able to favor certain writers manually. This is to make sure that people choose to ignore or favor certain writing instead of a computer telling people what to read because most people read that.

First results

I've played around with the API and got some interesting results. By using a simple algorithm on the Wiki data I'm able to relate people. If you give a name to the program, it will calculate who is most likely some kind of colleague and indeed if you're interested in person A the computer can guess you also like G and I (for example). Here's one printout:

Slavoj Zizek:
[(u'Slavoj \u017di\u017eek', 41), (u'Jacques Lacan', 9), (u'Antonio Negri', 8), (u'Kojin Karatani', 8), (u'Judith Butler', 7), (u'Rosa Luxemburg', 7), (u'Jacques Derrida', 7), (u'Chopper Read', 7), 
(u'Bo\u017eidar Debenjak', 7), (u'Victor Menezes', 6), (u'Julia Kristeva', 6), (u'Alexander Toradze', 6), (u'Ale\u0161 Debeljak', 6), (u'Jean-Pierre Jeunet', 6), (u'Luce Irigaray', 6), (u'Boeing 727', 6), (u'Stephen Bronner', 6), 
(u'Rastko Mo\u010dnik', 6), (u'Steve Brookstein', 6), (u'Alain Badiou', 6)]

It's interesting that the algorithm can easily predict itself whether results will be reasonable or bad. The algorithm can use some fine tuning to get rid of the nonsense like Boeing 727 :) I do have idea's on how to do that, but making the calculations already takes up 10 to 20 minutes, so imagine with an improved version ... I'm optimizing before expanding for sure. Django could be my best friend in this.

Without being aware of it the results lead to some kind of new search engine. I like the emotions that I get while viewing the results. It seems like my attention is brought to interesting new people by using it.

Stage two plans

The most important thing now is to optimize. I assume the URL requests are taking the most time. The program will often make more than a thousand request, because Category:Living_people is often fully investigated. This means it has to go through half a million names. If I would save the category listing with Django in a sort of cache I could create the same results without over asking the connection.

Improved ranking

There's a few interesting things I can do with the code once I optimized with Django to improve the ordering of the results.

I can set up a “control group” for each search and use that data to make common used Categories less important than rare categories. This tool can easily be transformed to filter a vocabulary from Categories (for example the rare categories) which further expands the possibilities.
I could distinguish between related and unrelated categories which may improve the ordering of results (pushing Boeing 727 and irrelevant people to the back) especially when dealing with people less documented.
An alternative to improve results is relating categories and names to categories and names used on the page itself. This may delete results like Boeing 727 and some irrelevant people (like Michael Jackson in results for Albert Einstein).

All options will compliment each other and the first draws the idea potentially into another direction (search on words instead of names). If the Django-cache code is flexible enough it might optimize these improvements as well

However I'm doubting whether I want to reorder the results, because I kinda like the dirtiness (makes me pleasantly surprised). I would then however like to know why something like Boeing 727 was associated with Zizek. This could be done with an improved printing procedure.

Application one: community kickstarter

Application two: personal persons

Application three: search engine

Prospects: crowd sourcing

RSS feedback loop system within community Feedback to Wiki community (critiques?) Improve English with Dutch grammar hCard??

Critique

A point of possible critique is that Wikipedia is not for the common people and the same may be true for this algorithm. It might only be useful for people like me, who only know a little of a lot and are curious for more.

Michael Jackson

[(u'Michael Jackson', 60), (u'Jermaine Jackson', 20), (u'Janet Jackson', 19), (u'Stevie Wonder', 19), (u'Prince (musician)', 18), (u'Madonna (entertainer)', 18), (u'Justin Timberlake', 17), (u'Bob Dylan', 16), (u'Paul McCartney', 16), (u'Tina Turner', 16), (u'Marlon Jackson', 16), (u'La Toya Jackson', 15), (u'Mariah Carey', 15), (u'Lionel Richie', 15), (u'Britney Spears', 15), (u'Whitney Houston', 15), (u'Diana Ross', 15), (u'Little Richard', 15), (u'Usher (entertainer)', 14), (u'Christina Aguilera', 14)]

@@ Line 67: / Line 67: @@
 Michael Jackson
-<source lang=”python”>
+<source lang="python">
 [(u'Michael Jackson', 60), (u'Jermaine Jackson', 20), (u'Janet Jackson', 19), (u'Stevie Wonder', 19), (u'Prince (musician)', 18), (u'Madonna (entertainer)', 18), (u'Justin Timberlake', 17), (u'Bob Dylan', 16), (u'Paul McCartney', 16), (u'Tina Turner', 16), (u'Marlon Jackson', 16), (u'La Toya Jackson', 15), (u'Mariah Carey', 15), (u'Lionel Richie', 15), (u'Britney Spears', 15), (u'Whitney Houston', 15), (u'Diana Ross', 15), (u'Little Richard', 15), (u'Usher (entertainer)', 14), (u'Christina Aguilera', 14)]
 </source>