Roll your own google: Difference between revisions
(→Links) |
|||
(47 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
Google dominates contemporary access to the Internet, | Google dominates contemporary access to the Internet, having become for many net users virtually synonymous not only with search, but online video (through YouTube), and mobile (through the Android platform). | ||
As the web first developed in the 1990s, websites were sparse isolated islands, tethered together by [[webrings]] and a patchwork of amateur link lists and proto-[[portals]]. Early search sites balanced between (human) editorially maintained portal sites and a variety of "indexes" based on web crawlers. The [[wikipedia:Altavista|Altavista]] search engine was perhaps the first breakthrough in terms of tackling the ever-growing scale of the online web of documents and shot to an early lead among a field of many, often diversely focused web search engines.<ref>[http://books.google.be/books?id=7dV-7uIzp2QC&pg=PA262&lpg=PA262&dq=zip2+search&source=bl&ots=fPMCbUt_mW&sig=mtc3UupYmeXvQiolepkjCuoYxDg&hl=en&sa=X&ei=HnAVU4rwKsLnywOjx4CIDQ&ved=0CFwQ6AEwBg#v=onepage&q=zip2%20search&f=false Search Engines Book from 2001] describes the "big 6" search engines of the time: including AltaVista, Yahoo, Excite, and MSN .</ref> Google, however, perfected the formula by taking on not only the scale, but leveraging the link structure of the net itself, via its [[wikipedia:PageRank|PageRank]] algorithm to deliver even better search results, while managing (in fact thriving) on the volume of the web. This algorithm, first patented September 4, 2001, forms the basis of Google's commercial success and the exact mechanisms of their current algorithms to produce search results are closely guarded trade secrets. | |||
This exercise is at once a simple exercise in CGI scripting and an opportunity to critically reflect on the state of the Web and the role of centralized commercial services such as Google. | |||
== Exercise == | |||
Create a cgi "search engine". It needs to be '''self-contained''', that is contain it's own index based on your own specific crawling of data. Part of the point of doing this is to reflect on the question of data centralization and specificity. What does it mean to create your own index? | |||
Your "results" could be "purely" algorithmic, and/or based only on input provided to it (via the search box), and/or using either collected or crawled data you've yourself gathered. It's only fair, that's how Google works too. | |||
'''STARTER CODE --> DOWNLOAD HERE''' | |||
http://pzwart3.wdka.hro.nl/~mmurtaugh/share/ryog.zip | |||
== CGI == | == CGI == | ||
Line 8: | Line 17: | ||
* Respond to browser (with audio?) | * Respond to browser (with audio?) | ||
== | == How search works (according to Google) == | ||
* [[ | {{youtube|BNHR6IQJGZs}} | ||
* [[ | |||
* [http://www.google.com/insidesearch/howsearchworks/ Contemporary (2014) corporate communication about Google about "How Search Works"] | |||
== The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) == | |||
<blockquote> | |||
Each document is converted into a set of word occurrences called hits. The hits record the word, position in document, an approximation of font size, and capitalization. The indexer distributes these hits into a set of "barrels", creating a partially sorted forward index. The indexer performs another important function. It parses out all the links in every web page and stores important information about them in an anchors file. This file contains enough information to determine where each link points from and to, and the text of the link. | |||
</blockquote> | |||
Source: [http://infolab.stanford.edu/~backrub/google.html The Anatomy of a Large-Scale Hypertextual Web Search Engine] Research paper Google c-founders Sergey Brin and Lawrence (Larry) Page. 1998. | |||
== Search circa 1996 == | |||
http://archive.org/details/ComputerChronicles-SearchEngines_861 | |||
Computer Chronicles on Search (1996) | |||
== Search circa 1896 == | |||
* [http://www.gutenberg.org/ebooks/search/?query=encyclopedia Encyclopedia] | |||
* [http://digitalpublishingtoolkit.org/2013/07/py-clave-and-books-as-api/ Book as an API] | |||
== Epub == | |||
Emerging standards for electronic books such as [http://idpf.org/epub epub] are interesting in that they bridge web practices and formats with those of tradition publishing. | |||
== Alphabets == | |||
* [http://www.gutenberg.org/ebooks/search/?query=alphabet Alphabets on Gutenberg] | |||
** [http://www.gutenberg.org/ebooks/30117 ABC Jules Lemaître]... Education | |||
** [http://www.gutenberg.org/ebooks/16081 Anti-Slavery Alphabet]... Resistance | |||
** [http://www.gutenberg.org/ebooks/22427 Typographic/Printing] | |||
* [http://www.geuzen.org/female_icons/play/letters.php De Geuzen Write with Icons] | |||
* [http://www.ubu.com/film/rosler_semiotics.html Martha Rosler Semiology of the Kitchen] | |||
* [http://www.ubu.com/film/fluxfilm28_sharits.html Paul Sharits, Word Movie (Fluxus)] | |||
* http://squarevzw.be/ensuite/femkeinterview.htm | |||
* [http://www.gutenberg.org/ebooks/search/?query=dictionnaire Dictionnaire @ Gutenburg] | |||
** [http://www.gutenberg.org/ebooks/14156 Flaubert's Dictionary of Received Ideas] | |||
== Alternative Search Engines == | |||
* [http://networkcultures.org/wpmu/query/2009/11/14/matthew-fuller-search-engine-alternatives/ Matthew Fuller's Presentation at Society of the Query #1] [https://vimeo.com/81486332 Video] | |||
In it he uses three alterative search engines: | |||
* [http://en.wikipedia.org/wiki/Viewzi Viewzi] ("defunct" since 2010) | |||
* [http://en.wikipedia.org/wiki/Kartoo Kartoo], also stopped in 2010 | |||
and... | |||
=== oamos === | |||
* [http://www.oamos.com/ Oamos] by [http://de.wikipedia.org/wiki/Marc_Lee Marc Lee], who also participated in the Tracenoizer project... | |||
=== tracenoizer === | |||
* [http://www.1go1.net/index.php/Main/Tracenoizer Tracenoizer] by Zurich based net art collective LAN a unique "identity" management solution that allows for tactical scraping to create false "clone" homepages & publish them online [http://www.anninaruest.com/a/tracenoizer/index.html Screenshots] | |||
=== soundbrowser === | |||
* While (in part) at PZI Matthias Hurtl developed [http://index.randomaccessmemory.at/index.php?/projects/soundbrowser/ soundbrowser] that searches freesound.org and presents the results as a bank of looping audio players allowing for parallel playing and mixing. | |||
===tumblrjumpr=== | |||
* http://www.absentarrays.info/lvdbc/tumblrjumpr/ | |||
== Building a search engine in Python == | |||
* [[Web Spider in Python]] | |||
* [http://pzwart3.wdka.hro.nl/~lchristensen/openlister.html Lasse's crawler], basis for his tumblrjumpr installation (see books) | |||
* [http://p-dpa.tumblr.com/ Silvio Lorusso's Post-Digital Publishing Archive] | |||
* [[Counting word frequency in a text with Python]] | |||
* [[Whoosh]] is a library for indexing texts written in Python | |||
== | == Notes == | ||
<references /> | |||
Latest revision as of 11:15, 11 March 2014
Google dominates contemporary access to the Internet, having become for many net users virtually synonymous not only with search, but online video (through YouTube), and mobile (through the Android platform).
As the web first developed in the 1990s, websites were sparse isolated islands, tethered together by webrings and a patchwork of amateur link lists and proto-portals. Early search sites balanced between (human) editorially maintained portal sites and a variety of "indexes" based on web crawlers. The Altavista search engine was perhaps the first breakthrough in terms of tackling the ever-growing scale of the online web of documents and shot to an early lead among a field of many, often diversely focused web search engines.[1] Google, however, perfected the formula by taking on not only the scale, but leveraging the link structure of the net itself, via its PageRank algorithm to deliver even better search results, while managing (in fact thriving) on the volume of the web. This algorithm, first patented September 4, 2001, forms the basis of Google's commercial success and the exact mechanisms of their current algorithms to produce search results are closely guarded trade secrets.
This exercise is at once a simple exercise in CGI scripting and an opportunity to critically reflect on the state of the Web and the role of centralized commercial services such as Google.
Exercise
Create a cgi "search engine". It needs to be self-contained, that is contain it's own index based on your own specific crawling of data. Part of the point of doing this is to reflect on the question of data centralization and specificity. What does it mean to create your own index? Your "results" could be "purely" algorithmic, and/or based only on input provided to it (via the search box), and/or using either collected or crawled data you've yourself gathered. It's only fair, that's how Google works too.
STARTER CODE --> DOWNLOAD HERE http://pzwart3.wdka.hro.nl/~mmurtaugh/share/ryog.zip
CGI
- Start with a simple form (HTTP/Post/Submit!)
- Can respond with ANY type (image/audio/...)
- Respond to browser (with audio?)
How search works (according to Google)
The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998)
Each document is converted into a set of word occurrences called hits. The hits record the word, position in document, an approximation of font size, and capitalization. The indexer distributes these hits into a set of "barrels", creating a partially sorted forward index. The indexer performs another important function. It parses out all the links in every web page and stores important information about them in an anchors file. This file contains enough information to determine where each link points from and to, and the text of the link.
Source: The Anatomy of a Large-Scale Hypertextual Web Search Engine Research paper Google c-founders Sergey Brin and Lawrence (Larry) Page. 1998.
Search circa 1996
http://archive.org/details/ComputerChronicles-SearchEngines_861
Computer Chronicles on Search (1996)
Search circa 1896
Epub
Emerging standards for electronic books such as epub are interesting in that they bridge web practices and formats with those of tradition publishing.
Alphabets
- Alphabets on Gutenberg
- ABC Jules Lemaître... Education
- Anti-Slavery Alphabet... Resistance
- Typographic/Printing
- De Geuzen Write with Icons
- Martha Rosler Semiology of the Kitchen
- Paul Sharits, Word Movie (Fluxus)
- http://squarevzw.be/ensuite/femkeinterview.htm
- Dictionnaire @ Gutenburg
Alternative Search Engines
In it he uses three alterative search engines:
and...
oamos
tracenoizer
- Tracenoizer by Zurich based net art collective LAN a unique "identity" management solution that allows for tactical scraping to create false "clone" homepages & publish them online Screenshots
soundbrowser
- While (in part) at PZI Matthias Hurtl developed soundbrowser that searches freesound.org and presents the results as a bank of looping audio players allowing for parallel playing and mixing.
tumblrjumpr
Building a search engine in Python
- Web Spider in Python
- Lasse's crawler, basis for his tumblrjumpr installation (see books)
- Silvio Lorusso's Post-Digital Publishing Archive
- Counting word frequency in a text with Python
- Whoosh is a library for indexing texts written in Python
Notes
- ↑ Search Engines Book from 2001 describes the "big 6" search engines of the time: including AltaVista, Yahoo, Excite, and MSN .