Sniff, Scrape, Crawl (Prototyping)
Revision as of 14:35, 19 May 2014 by Michael Murtaugh (talk | contribs)
In 2011, Sniff, Scrape, Crawl was a thematic project led by Aymeric Mansoux, Renee Turner, and Michael Murtaugh.
This prototyping module covers some of the core themes and tools around the practice of "scraping", with the goal to better familiarize yourself with the possibilities of this technique and to develop strategic uses of the tools for your specific research. This module follows on the ideas developed in Roll your own google.
Meeting 1
Morning: Scraping Tools
Scraping tools and recipes exist at a variety of scales:
- small: Simple Web Spider in Python
- medium: Scrapy, python "framework" inspired by web frameworks like Django specifically for scraping
- large: Heritrix, full-fledged tools used for institutional scraping purposed like tradition libraries and the Internet archive (archive.org).
We'll use the hot seat technique to get our collective feet wet with some simple scraping tools, focussing on the small to medium scale.
Afternoon: Meeting to discuss / develop / brainstorm project ideas
Meet as a group to discuss/brainstorm ideas for individual research / projects.
Meeting 2
- Workshop / Tutorials
Meeting 3
"Proof of concept" presentations
Some Examples
- focused_crawls
- Lasse's Tumblr Jumper
- Birgit Bachler's Bonus Card Friends