Scraping: Difference between revisions

Revision as of 07:26, 18 May 2012

Scraping (also Screen Scraping) is the process of extracting data out of something.

In the course, we have used the library BeautifulSoup to manipulate HTML pages in Python.

Other interesting libraries to consider:

Mechanize in essence simulates a browser in Python, that can "remember" things (like cookies / sessions) between pages
lxml which can apparently deal with "mal-formed" HTML and quickly convert them to xml trees
html5lib

Resources:

Revision as of 07:26, 18 May 2012 (view source) Michael Murtaugh (talk \| contribs) No edit summary ← Older edit		Revision as of 07:26, 18 May 2012 (view source) Michael Murtaugh (talk \| contribs) No edit summary Newer edit →
Line 11:		Line 11:
	* http://us.pycon.org/2009/tutorials/schedule/2AM8/		* http://us.pycon.org/2009/tutorials/schedule/2AM8/

	See [[Extracting parts of an HTML ~~page~~]] and other recipes in the [[:Category:Cookbook]]		See [[Extracting parts of an HTML document]] and other recipes in the [[:Category:Cookbook]]