Sniff, Scrape, Crawl (Prototyping): Difference between revisions

Revision as of 15:31, 19 May 2014

In 2011, Sniff, Scrape, Crawl was a thematic project led by Aymeric Mansoux, Renee Turner, and Michael Murtaugh.

This prototyping module covers some of the core themes and tools around the practice of "scraping", with the goal to better familiarize yourself with the possibilities of this technique and to develop strategic uses of the tools for your specific research.

Meeting 1

Scraping Tools

Scraping tools and recipes exist at a variety of scales:

small: Simple Web Spider in Python
medium: Scrapy, python "framework" inspired by web frameworks like Django specifically for scraping
large: Heritrix, full-fledged tools used for institutional scraping purposed like tradition libraries and the Internet archive (archive.org).

Afternoon: Meeting to discuss / develop / brainstorm project ideas

Some Examples

Links

http://blog.osvdb.org/2014/05/07/the-scraping-problem-and-ethics/

@@ Line 6: / Line 6: @@
 == Meeting 1 ==
 === Scraping Tools ===
-* S: [[Simple Web Spider in Python]]
+Scraping tools and recipes exist at a variety of scales:
-* M: [http://scrapy.org/ Scrapy]
+* small: [[Simple Web Spider in Python]]
-* L: [http://en.wikipedia.org/wiki/Heritrix Heritrix]
+* medium: [http://scrapy.org/ Scrapy], python "framework" inspired by web frameworks like Django specifically for scraping
+* large: [http://en.wikipedia.org/wiki/Heritrix Heritrix], full-fledged tools used for institutional scraping purposed like tradition libraries and the Internet archive (archive.org).
 === Afternoon: Meeting to discuss / develop / brainstorm project ideas ===