Sniff, Scrape, Crawl (Prototyping): Difference between revisions

From XPUB & Lens-Based wiki
 
(39 intermediate revisions by 2 users not shown)
Line 1: Line 1:
In 2011, [[Sniff, Scrape, Crawl]] was a thematic project led by Aymeric Mansoux, Renee Turner, and Michael Murtaugh.
In 2011, [[Sniff, Scrape, Crawl]] was a thematic project led by Aymeric Mansoux, Renee Turner, and Michael Murtaugh.


This prototyping module will in part revisit some of the themes of this thematic project and in particular focus on the tools and practices of scraping.
This prototyping module covers some of the core themes and tools around the practice of "scraping", with the goal to better familiarize yourself with the possibilities of this technique and to develop strategic uses of the tools for your specific research/projects. This module follows on the ideas developed in [[Roll your own google]].


== Elements ==
== Meeting 1: May 20 ==
* Spidering
=== Morning: Scraping Tools (11:00) ===
* Crawling
Scraping tools and recipes exist at a variety of scales:
* Indexing
* small: [[Simple scraping with wget]], [[Simple Web Spider in Python]], you can get pretty far with just some standard python code and some loops...
* Summarizing
* medium: [http://scrapy.org/ Scrapy], python "framework" inspired by web frameworks like Django specifically for scraping
* Break up the steps of Whoosh's indexing ()
* large: [http://en.wikipedia.org/wiki/Heritrix Heritrix], full-fledged tools used for institutional scraping; used by tradition libraries among others, and provided by (and used extensively for) the [[Internet Archive]] (aka archive.org).


== Tools ==
We'll use the [http://pzwart3.wdka.hro.nl/hotseat/ hot seat] to get our collective feet wet with some simple scraping tools, focussing on the small to medium scale.
* [[python]]
* [http://scrapy.org/ Scrapy]  


== Examples ==
=== Afternoon: Meeting to discuss / develop / brainstorm project ideas ===
* Tumblr Jumper
Meet as a group to discuss/brainstorm ideas for individual research / projects.
* News Tweek
 
== Meeting 2: May 27 ==
* Workshop (Topic/Tool to be determined based on brainstorm) / Tutorials '''27 May'''
[[Web scraping with Python]]
 
[[Wikipedia Image Scraping]]
 
== Meeting 3: June 30 (Final session) ==
Presentation of your prototype for the (joint) final presentation Monday '''30 June'''
 
== Some Examples ==
* [https://archive.org/search.php?query=collection%3A%22focused_crawls%22 focused_crawls] are examples of dumps (typically made using the Heritrix tool) of various public websites which are then place (in archive format) on the Internet archive
* [http://www.birgitbachler.com/portfolio/portfolio/bonuskaart-friends/ Birgit Bachler's Bonus Card Friends]
* [http://www.birgitbachler.com/portfolio/portfolio/bonuskaart-friends/ Birgit Bachler's Bonus Card Friends]


== Links ==
== Links ==
* https://exposingtheinvisible.org/resources/obtaining-evidence/scraping-parsing/
* http://blog.osvdb.org/2014/05/07/the-scraping-problem-and-ethics/
* http://blog.osvdb.org/2014/05/07/the-scraping-problem-and-ethics/
* https://github.com/fallgesetz/Google-Art-Project-Scraper
* [http://scraperwiki.com/ ScraperWiki] [http://onlinejournalismblog.com/2010/07/07/an-introduction-to-data-scraping-with-scraperwiki/ discussed in relation to journalism]... see Also [https://classic.scraperwiki.com/browse/scrapers/index.html Classic ScraperWiki]
* http://www.goodiff.org/
* http://knightlab.northwestern.edu/2014/03/20/five-data-scraping-tools-for-would-be-data-journalists/
from which
* http://www.outwit.com/products/hub/
* http://dbpedia.org/About
* http://2309digitalsignatures.e-permanent.org/ SVG signatures from wikipedia
* http://www.wikigifs.org/ Every animated GIF from wikipedia

Latest revision as of 15:00, 16 June 2014

In 2011, Sniff, Scrape, Crawl was a thematic project led by Aymeric Mansoux, Renee Turner, and Michael Murtaugh.

This prototyping module covers some of the core themes and tools around the practice of "scraping", with the goal to better familiarize yourself with the possibilities of this technique and to develop strategic uses of the tools for your specific research/projects. This module follows on the ideas developed in Roll your own google.

Meeting 1: May 20

Morning: Scraping Tools (11:00)

Scraping tools and recipes exist at a variety of scales:

  • small: Simple scraping with wget, Simple Web Spider in Python, you can get pretty far with just some standard python code and some loops...
  • medium: Scrapy, python "framework" inspired by web frameworks like Django specifically for scraping
  • large: Heritrix, full-fledged tools used for institutional scraping; used by tradition libraries among others, and provided by (and used extensively for) the Internet Archive (aka archive.org).

We'll use the hot seat to get our collective feet wet with some simple scraping tools, focussing on the small to medium scale.

Afternoon: Meeting to discuss / develop / brainstorm project ideas

Meet as a group to discuss/brainstorm ideas for individual research / projects.

Meeting 2: May 27

  • Workshop (Topic/Tool to be determined based on brainstorm) / Tutorials 27 May

Web scraping with Python

Wikipedia Image Scraping

Meeting 3: June 30 (Final session)

Presentation of your prototype for the (joint) final presentation Monday 30 June

Some Examples

Links

from which