Sniff, Scrape, Crawl (Prototyping): Difference between revisions
No edit summary |
(→Tools) |
||
Line 11: | Line 11: | ||
== Tools == | == Tools == | ||
* [[ | * S: [[Simple Web Spider in Python]] | ||
* [http://scrapy.org/ Scrapy] | * M: [http://scrapy.org/ Scrapy] | ||
* L: [http://en.wikipedia.org/wiki/Heritrix Heritrix] | |||
== Examples == | == Examples == |
Revision as of 15:25, 19 May 2014
In 2011, Sniff, Scrape, Crawl was a thematic project led by Aymeric Mansoux, Renee Turner, and Michael Murtaugh.
This prototyping module will in part revisit some of the themes of this thematic project and in particular focus on the tools and practices of scraping.
Elements
- Spidering
- Crawling
- Indexing
- Summarizing
- Break up the steps of Whoosh's indexing ()
Tools
- S: Simple Web Spider in Python
- M: Scrapy
- L: Heritrix
Examples
- focused_crawls
- Lasse's Tumblr Jumper
- Birgit Bachler's Bonus Card Friends