Scraping: Difference between revisions

From XPUB & Lens-Based wiki
No edit summary
No edit summary
Line 7: Line 7:
* [http://codespeak.net/lxml/ lxml] which can apparently deal with "mal-formed" HTML and quickly convert them to xml trees
* [http://codespeak.net/lxml/ lxml] which can apparently deal with "mal-formed" HTML and quickly convert them to xml trees
* [http://code.google.com/p/html5lib/ html5lib]
* [http://code.google.com/p/html5lib/ html5lib]
* http://github.com/davisp/python-spidermonkey/tree/master Python SpiderMonkey... hmmm not sure exactly what this does yet!


Resources:
Resources:
* http://us.pycon.org/2009/tutorials/schedule/2AM8/
* http://us.pycon.org/2009/tutorials/schedule/2AM8/

Revision as of 14:10, 11 April 2009

Scraping (also Screen Scraping) is the process of extracting data out of something.

In the course, we have used the library BeautifulSoup to manipulate HTML pages in Python.

Other interesting libraries to consider:

Resources: