Roll your own google: Difference between revisions

From XPUB & Lens-Based wiki
No edit summary
No edit summary
Line 1: Line 1:
{{underconstruction}}
{{underconstruction}}


Google dominates contemporary access to the Internet, becoming virtually synonymous with search, online video, and through Android increasingly mobile.
Google dominates contemporary access to the Internet, having become for many net users virtually synonymous not only with search, but online video (through youtube), and mobile (through the Android platform).


BACK in the early daze, net sites were sparse isolated islands of, tethered together with [[webrings]] and a patchwork of amateur link lists and proto-[[portals]]. This exercise is at once a simple exercise in CGI scripting and an earnest effort to take back the web. Restriction: all data that your cgi uses must be local to the server -- meaning your "results" will be purely algorithmic, and/or based only on input provided to it (via the search box), and/or using either collected or crawled data you've yourself gathered. It's only fair, that's how Google works too.
BACK in the daze, net sites were sparse isolated islands, tethered together by [[webrings]] and a patchwork of amateur link lists and proto-[[portals]]. Early search sites balanced between (human) editorially maintained portal sites and a variety of "indexes" based on web crawlers. The [[wikipedia:Altavista|Altavista]] search engine was perhaps the first breakthrough in terms of managing the scale of the ever growing web of documents and shot to an early lead among a field of many, often diversely focussed web search engines.<ref>http://books.google.be/books?id=7dV-7uIzp2QC&pg=PA262&lpg=PA262&dq=zip2+search&source=bl&ots=fPMCbUt_mW&sig=mtc3UupYmeXvQiolepkjCuoYxDg&hl=en&sa=X&ei=HnAVU4rwKsLnywOjx4CIDQ&ved=0CFwQ6AEwBg#v=onepage&q=zip2%20search&f=false Search Engines Book from 2001]</ref> Google, however, perfected the formula by taking on not only the scale, but leveraging the link structure of the net itself, via its [[wikipedia:PageRank|PageRank]] algorithm to deliver even better search results, while managing (in fact thriving) on the volume of the web.
 
This exercise is at once a simple exercise in CGI scripting and an opportunity to critically reflect on the state of the Web and the role of centralized commercial services such as Google.
 
Restriction: all data that your cgi uses must be local to the server -- meaning your "results" will be purely algorithmic, and/or based only on input provided to it (via the search box), and/or using either collected or crawled data you've yourself gathered. It's only fair, that's how Google works too.


== CGI ==
== CGI ==
Line 20: Line 24:
* Crawling
* Crawling
* Danya Vasiliev's Puppet piece ?!
* Danya Vasiliev's Puppet piece ?!
== Notes ==
<references />

Revision as of 10:24, 4 March 2014

Construction.gif This page is currently being worked on.

Google dominates contemporary access to the Internet, having become for many net users virtually synonymous not only with search, but online video (through youtube), and mobile (through the Android platform).

BACK in the daze, net sites were sparse isolated islands, tethered together by webrings and a patchwork of amateur link lists and proto-portals. Early search sites balanced between (human) editorially maintained portal sites and a variety of "indexes" based on web crawlers. The Altavista search engine was perhaps the first breakthrough in terms of managing the scale of the ever growing web of documents and shot to an early lead among a field of many, often diversely focussed web search engines.[1] Google, however, perfected the formula by taking on not only the scale, but leveraging the link structure of the net itself, via its PageRank algorithm to deliver even better search results, while managing (in fact thriving) on the volume of the web.

This exercise is at once a simple exercise in CGI scripting and an opportunity to critically reflect on the state of the Web and the role of centralized commercial services such as Google.

Restriction: all data that your cgi uses must be local to the server -- meaning your "results" will be purely algorithmic, and/or based only on input provided to it (via the search box), and/or using either collected or crawled data you've yourself gathered. It's only fair, that's how Google works too.

CGI

  • Start with a simple form (HTTP/Post/Submit!)
  • Can respond with ANY type (image/audio/...)
  • Respond to browser (with audio?)

Links

  • Eliza and Weizenbaum's clever text and response (chatbot as a kind of search engine)
  • Scrapy

Creating an index

  • How does an algorithm "see" a text, a sound, a video, a webpage
  • Beautiful Soup
  • Lucene... intro, tokenizer
  • Crawling
  • Danya Vasiliev's Puppet piece ?!

Notes