Web scraping with Python: Difference between revisions

From XPUB & Lens-Based wiki
No edit summary
No edit summary
Line 2: Line 2:
* [[python]]
* [[python]]
* [[html5lib]]
* [[html5lib]]
* [https://docs.python.org/2/library/xml.etree.elementtree.html ElementTree] part of the standard python library
* [http://docs.python.org/2/library/xml.etree.elementtree.html ElementTree] part of the standard python library


html5lib is a python library for parsing "html in the wild". A big advantage of working with html5lib, is that, unlike stricter XML tools, it will accept any input document, even when there are missing or incorrect tags. The library follows the "tolerant" behaviour of most web browsers and is very useful as a bridge between "pages in the wild" and the precision of XML tools like ElementTree.
html5lib is a python library for parsing "html in the wild". A big advantage of working with html5lib, is that, unlike stricter XML tools, it will accept any input document, even when there are missing or incorrect tags. The library follows the "tolerant" behaviour of most web browsers and is very useful as a bridge between "pages in the wild" and the precision of XML tools like ElementTree.
Line 11: Line 11:


While more sophisticated (and faster) libraries for working with HTML/XML exist (namely lxml), python's standard ElementTree implementation is quite capable (and using lxml requires a separate C module to be compiled on your platform which can sometimes complicate installation / distribution of your script).
While more sophisticated (and faster) libraries for working with HTML/XML exist (namely lxml), python's standard ElementTree implementation is quite capable (and using lxml requires a separate C module to be compiled on your platform which can sometimes complicate installation / distribution of your script).


ElementTree supports a small subset of the (more extensive) xpath query language:
ElementTree supports a small subset of the (more extensive) xpath query language:

Revision as of 17:41, 26 May 2014

Tools

html5lib is a python library for parsing "html in the wild". A big advantage of working with html5lib, is that, unlike stricter XML tools, it will accept any input document, even when there are missing or incorrect tags. The library follows the "tolerant" behaviour of most web browsers and is very useful as a bridge between "pages in the wild" and the precision of XML tools like ElementTree.

While html5lib is not part of the standard python distribution, it is 100% "pure python" meaning that it's easy to use across platforms (even without a tool like pip working, you can download and import the library by simply placing it's folder in the same directory as your python script).

ElementTree

While more sophisticated (and faster) libraries for working with HTML/XML exist (namely lxml), python's standard ElementTree implementation is quite capable (and using lxml requires a separate C module to be compiled on your platform which can sometimes complicate installation / distribution of your script).

ElementTree supports a small subset of the (more extensive) xpath query language:

See http://docs.python.org/2/library/xml.etree.elementtree.html#supported-xpath-syntax

Examples