Scraping: Difference between revisions
No edit summary |
No edit summary |
||
Line 10: | Line 10: | ||
Resources: | Resources: | ||
* http://us.pycon.org/2009/tutorials/schedule/2AM8/ | * http://us.pycon.org/2009/tutorials/schedule/2AM8/ | ||
* http://scrapy.org/ Python framework for custom scrapers | |||
See [[Extracting parts of an HTML document]] and other recipes in the [[:Category:Cookbook]] | See [[Extracting parts of an HTML document]] and other recipes in the [[:Category:Cookbook]] |
Revision as of 16:28, 8 October 2012
Scraping (also Screen Scraping) is the process of extracting data out of something.
In the course, we have used the library BeautifulSoup to manipulate HTML pages in Python.
Other interesting libraries to consider:
- Mechanize in essence simulates a browser in Python, that can "remember" things (like cookies / sessions) between pages
- lxml which can apparently deal with "mal-formed" HTML and quickly convert them to xml trees
- html5lib
Resources:
- http://us.pycon.org/2009/tutorials/schedule/2AM8/
- http://scrapy.org/ Python framework for custom scrapers
See Extracting parts of an HTML document and other recipes in the Category:Cookbook