Scraping: Difference between revisions

From XPUB & Lens-Based wiki
No edit summary
No edit summary
Line 13: Line 13:


See [[Extracting parts of an HTML document]] and other recipes in the [[:Category:Cookbook]]
See [[Extracting parts of an HTML document]] and other recipes in the [[:Category:Cookbook]]
== Scraping HTML in Python with html5lib + css selectors (2022) ==

Revision as of 15:30, 25 May 2022

Scraping (also Screen Scraping) is the process of extracting data out of something.

In the course, we have used the library BeautifulSoup to manipulate HTML pages in Python.

Other interesting libraries to consider:

  • Mechanize in essence simulates a browser in Python, that can "remember" things (like cookies / sessions) between pages
  • lxml which can apparently deal with "mal-formed" HTML and quickly convert them to xml trees
  • html5lib

Resources:

See Extracting parts of an HTML document and other recipes in the Category:Cookbook

Scraping HTML in Python with html5lib + css selectors (2022)