Jump to content

XPUB & Lens-Based wiki

Scraping

From XPUB & Lens-Based wiki

Revision as of 14:30, 25 May 2022 by Michael Murtaugh (talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Scraping (also Screen Scraping) is the process of extracting data out of something.

In the course, we have used the library BeautifulSoup to manipulate HTML pages in Python.

Other interesting libraries to consider:

Mechanize in essence simulates a browser in Python, that can "remember" things (like cookies / sessions) between pages
lxml which can apparently deal with "mal-formed" HTML and quickly convert them to xml trees
html5lib

Resources:

http://us.pycon.org/2009/tutorials/schedule/2AM8/
http://scrapy.org/ Python framework for custom scrapers

See Extracting parts of an HTML document and other recipes in the Category:Cookbook

Scraping HTML in Python with html5lib + css selectors (2022)

Retrieved from "https://pzwiki.wdka.nl/mw-mediadesign/index.php?title=Scraping&oldid=218558"