Sniff, Scrape, Crawl (Prototyping)
In 2011, Sniff, Scrape, Crawl was a thematic project led by Aymeric Mansoux, Renee Turner, and Michael Murtaugh.
This prototyping module will in part revisit some of the themes of this thematic project and in particular focus on the tools and practices of scraping.
Elements
- Spidering
- Crawling
- Indexing
- Summarizing
- Break up the steps of Whoosh's indexing ()
Tools
- S: Simple Web Spider in Python
- M: Scrapy
- L: Heritrix
Examples
- focused_crawls
- Lasse's Tumblr Jumper
- Birgit Bachler's Bonus Card Friends