Sniff, Scrape, Crawl (Prototyping): Difference between revisions

From XPUB & Lens-Based wiki
No edit summary
Line 13: Line 13:
* [[python]]
* [[python]]
* [http://scrapy.org/ Scrapy]
* [http://scrapy.org/ Scrapy]
* [https://archive.org/search.php?query=collection%3A%22focused_crawls%22 focused_crawls]


== Examples ==
== Examples ==
* Tumblr Jumper
* [https://archive.org/search.php?query=collection%3A%22focused_crawls%22 focused_crawls]
* News Tweek
* Lasse's Tumblr Jumper
* [http://www.birgitbachler.com/portfolio/portfolio/bonuskaart-friends/ Birgit Bachler's Bonus Card Friends]
* [http://www.birgitbachler.com/portfolio/portfolio/bonuskaart-friends/ Birgit Bachler's Bonus Card Friends]


== Links ==
== Links ==
* http://blog.osvdb.org/2014/05/07/the-scraping-problem-and-ethics/
* http://blog.osvdb.org/2014/05/07/the-scraping-problem-and-ethics/

Revision as of 15:23, 19 May 2014

In 2011, Sniff, Scrape, Crawl was a thematic project led by Aymeric Mansoux, Renee Turner, and Michael Murtaugh.

This prototyping module will in part revisit some of the themes of this thematic project and in particular focus on the tools and practices of scraping.

Elements

  • Spidering
  • Crawling
  • Indexing
  • Summarizing
  • Break up the steps of Whoosh's indexing ()

Tools

Examples

Links