ElementTree
ElementTree is an interface to XML documents (so like SVG, and HTML when preprocessed with say html5lib).
It's kind of mix between a list and a dictionary and gives you relatively easy access to both read (to say "scrape" data from webpage) and write (to say alter an SVG or webpage and then re-output it).
The basic rules:
- Everything is an "Element" which corresponds to an html tag like p, div, head, body, etc
- Element's have a ".tag" attribute (not a function) which is the name of of the tag (like "p" or "script")
- Elements also have a ".text" which is the text inside the tag (up to the first child element)
- Elements also have a ".tail" which is the text immediately following the tag, before the next tag.
- Elements when used in a loop, act like a list of their child elements (the tags inside)
- Elements have a get function that retrieves named attributes from the tag (like a dictionary)
While more sophisticated (and faster) libraries for working with HTML/XML exist (namely lxml), python's standard ElementTree implementation is quite capable (and using lxml requires a separate C module to be compiled on your platform which can sometimes complicate installation / distribution of your script).
ElementTree supports a small subset of the (more extensive) xpath query language:
Syntax | Meaning |
---|---|
tag | Selects all child elements with the given tag.
For example, spam selects all child elements named spam, and spam/egg selects all grandchildren named egg in all children named spam. |
* | Selects all child elements. For example, */egg
selects all grandchildren named egg. |
. | Selects the current node. This is mostly useful
at the beginning of the path, to indicate that it’s a relative path. |
// | Selects all subelements, on all levels beneath the
current element. For example, .//egg selects all egg elements in the entire tree. |
.. | Selects the parent element. |
[@attrib] | Selects all elements that have the given attribute. |
[@attrib='value'] | Selects all elements for which the given attribute
has the given value. The value cannot contain quotes. |
[tag] | Selects all elements that have a child named
tag. Only immediate children are supported. |
[position] | Selects all elements that are located at the given
position. The position can be either an integer (1 is the first position), the expression last() (for the last position), or a position relative to the last position (e.g. last()-1). |
Source: http://docs.python.org/2/library/xml.etree.elementtree.html#supported-xpath-syntax