ElementTree: Difference between revisions
No edit summary |
No edit summary |
||
Line 5: | Line 5: | ||
The basic rules: | The basic rules: | ||
# Everything is an "Element" which corresponds to an html tag like p, div, head, body, etc | # Everything is an "Element" which corresponds to an html tag like p, div, head, body, etc; A document is represented by it's root (or outermost) tag | ||
# Element's have a | # Element's have a '''.tag''' attribute (not a function) which is the name of of the tag (like "p" or "script") | ||
# Elements also have a | # Elements also have a '''.text''' which is the text inside the tag (up to the first child element) | ||
# Elements also have a | # Elements also have a '''.tail''' which is the text immediately following the tag, before the next tag. | ||
# Elements when used in a loop, act like a list of their child elements (the tags inside) | # Elements when used in a loop, act like a list of their child elements (the tags inside) | ||
# Elements have a get function that retrieves named attributes from the tag (like a dictionary) | # Elements have a '''.get()''' function that retrieves named attributes from the tag (like a dictionary) | ||
While more sophisticated (and faster) libraries for working with HTML/XML exist (namely lxml), python's standard ElementTree implementation is quite capable (and using lxml requires a separate C module to be compiled on your platform which can sometimes complicate installation / distribution of your script). | While more sophisticated (and faster) libraries for working with HTML/XML exist (namely lxml), python's standard ElementTree implementation is quite capable (and using lxml requires a separate C module to be compiled on your platform which can sometimes complicate installation / distribution of your script). |
Revision as of 13:35, 27 May 2014
ElementTree is an interface to XML documents (so like SVG, and HTML when preprocessed with say html5lib).
It's kind of mix between a list and a dictionary and gives you relatively easy access to both read (to say "scrape" data from webpage) and write (to say alter an SVG or webpage and then re-output it).
The basic rules:
- Everything is an "Element" which corresponds to an html tag like p, div, head, body, etc; A document is represented by it's root (or outermost) tag
- Element's have a .tag attribute (not a function) which is the name of of the tag (like "p" or "script")
- Elements also have a .text which is the text inside the tag (up to the first child element)
- Elements also have a .tail which is the text immediately following the tag, before the next tag.
- Elements when used in a loop, act like a list of their child elements (the tags inside)
- Elements have a .get() function that retrieves named attributes from the tag (like a dictionary)
While more sophisticated (and faster) libraries for working with HTML/XML exist (namely lxml), python's standard ElementTree implementation is quite capable (and using lxml requires a separate C module to be compiled on your platform which can sometimes complicate installation / distribution of your script).
ElementTree supports a small subset of the (more extensive) xpath query language:
Syntax | Meaning |
---|---|
tag | Selects all child elements with the given tag.
For example, spam selects all child elements named spam, and spam/egg selects all grandchildren named egg in all children named spam. |
* | Selects all child elements. For example, */egg
selects all grandchildren named egg. |
. | Selects the current node. This is mostly useful
at the beginning of the path, to indicate that it’s a relative path. |
// | Selects all subelements, on all levels beneath the
current element. For example, .//egg selects all egg elements in the entire tree. |
.. | Selects the parent element. |
[@attrib] | Selects all elements that have the given attribute. |
[@attrib='value'] | Selects all elements for which the given attribute
has the given value. The value cannot contain quotes. |
[tag] | Selects all elements that have a child named
tag. Only immediate children are supported. |
[position] | Selects all elements that are located at the given
position. The position can be either an integer (1 is the first position), the expression last() (for the last position), or a position relative to the last position (e.g. last()-1). |
Source: http://docs.python.org/2/library/xml.etree.elementtree.html#supported-xpath-syntax