Roll your own google

From XPUB & Lens-Based wiki

Google dominates contemporary access to the Internet, having become for many net users virtually synonymous not only with search, but online video (through YouTube), and mobile (through the Android platform).

As the web first developed in the 1990s, websites were sparse isolated islands, tethered together by webrings and a patchwork of amateur link lists and proto-portals. Early search sites balanced between (human) editorially maintained portal sites and a variety of "indexes" based on web crawlers. The Altavista search engine was perhaps the first breakthrough in terms of tackling the ever-growing scale of the online web of documents and shot to an early lead among a field of many, often diversely focused web search engines.[1] Google, however, perfected the formula by taking on not only the scale, but leveraging the link structure of the net itself, via its PageRank algorithm to deliver even better search results, while managing (in fact thriving) on the volume of the web. This algorithm, first patented September 4, 2001, forms the basis of Google's commercial success and the exact mechanisms of their current algorithms to produce search results are closely guarded trade secrets.

This exercise is at once a simple exercise in CGI scripting and an opportunity to critically reflect on the state of the Web and the role of centralized commercial services such as Google.

Exercise

Create a cgi "search engine". It needs to be self-contained, that is contain it's own index based on your own specific crawling of data. Part of the point of doing this is to reflect on the question of data centralization and specificity. What does it mean to create your own index? Your "results" could be "purely" algorithmic, and/or based only on input provided to it (via the search box), and/or using either collected or crawled data you've yourself gathered. It's only fair, that's how Google works too.

STARTER CODE --> DOWNLOAD HERE http://pzwart3.wdka.hro.nl/~mmurtaugh/share/ryog.zip

CGI

  • Start with a simple form (HTTP/Post/Submit!)
  • Can respond with ANY type (image/audio/...)
  • Respond to browser (with audio?)

How search works (according to Google)

Search circa 1996

http://archive.org/details/ComputerChronicles-SearchEngines_861

Computer Chronicles on Search (1996)

Search circa 1896

Epub

Emerging standards for electronic books such as epub are interesting in that they bridge web practices and formats with those of tradition publishing.

Alphabets

Alternative Search Engines

Indexing in Python

  • Whoosh is a library for indexing texts written in Python

Notes

  1. Search Engines Book from 2001 describes the "big 6" search engines of the time: including AltaVista, Yahoo, Excite, and MSN .