Scraping: Difference between revisions

Revision as of 15:12, 11 April 2009

Scraping (also Screen Scraping) is the process of extracting data out of something.

In the course, we have used the library BeautifulSoup to manipulate HTML pages in Python.

Other interesting libraries to consider:

Mechanize in essence simulates a browser in Python, that can "remember" things (like cookies / sessions) between pages
lxml which can apparently deal with "mal-formed" HTML and quickly convert them to xml trees
html5lib
pythoh-spidermonkey... javscript meets python

Resources:

@@ Line 7: / Line 7: @@
 * [http://codespeak.net/lxml/ lxml] which can apparently deal with "mal-formed" HTML and quickly convert them to xml trees
 * [http://code.google.com/p/html5lib/ html5lib]
-* http://github.com/davisp/python-spidermonkey/tree/master Python SpiderMonkey... hmmm not sure exactly what this does yet!
+* [http://code.google.com/p/python-spidermonkey/ pythoh-spidermonkey]... javscript meets python
 Resources:
 * http://us.pycon.org/2009/tutorials/schedule/2AM8/