ElementTree: Difference between revisions

Revision as of 13:35, 27 May 2014

ElementTree is an interface to XML documents (so like SVG, and HTML when preprocessed with say html5lib).

It's kind of mix between a list and a dictionary and gives you relatively easy access to both read (to say "scrape" data from webpage) and write (to say alter an SVG or webpage and then re-output it).

The basic rules:

Everything is an "Element" which corresponds to an html tag like p, div, head, body, etc; A document is represented by it's root (or outermost) tag
Element's have a .tag attribute (not a function) which is the name of of the tag (like "p" or "script")
Elements also have a .text which is the text inside the tag (up to the first child element)
Elements also have a .tail which is the text immediately following the tag, before the next tag.
Elements when used in a loop, act like a list of their child elements (the tags inside)
Elements have a .get() function that retrieves named attributes from the tag (like a dictionary)

While more sophisticated (and faster) libraries for working with HTML/XML exist (namely lxml), python's standard ElementTree implementation is quite capable (and using lxml requires a separate C module to be compiled on your platform which can sometimes complicate installation / distribution of your script).

ElementTree supports a small subset of the (more extensive) xpath query language:

Syntax	Meaning
`tag`	Selects all child elements with the given tag. For example, `spam` selects all child elements named `spam`, and `spam/egg` selects all grandchildren named `egg` in all children named `spam`.
`*`	Selects all child elements. For example, `*/egg` selects all grandchildren named `egg`.
`.`	Selects the current node. This is mostly useful at the beginning of the path, to indicate that it’s a relative path.
`//`	Selects all subelements, on all levels beneath the current element. For example, `.//egg` selects all `egg` elements in the entire tree.
`..`	Selects the parent element.
`[@attrib]`	Selects all elements that have the given attribute.
`[@attrib='value']`	Selects all elements for which the given attribute has the given value. The value cannot contain quotes.
`[tag]`	Selects all elements that have a child named `tag`. Only immediate children are supported.
`[position]`	Selects all elements that are located at the given position. The position can be either an integer (1 is the first position), the expression `last()` (for the last position), or a position relative to the last position (e.g. `last()-1`).

Source: http://docs.python.org/2/library/xml.etree.elementtree.html#supported-xpath-syntax

@@ Line 5: / Line 5: @@
 The basic rules:
-# Everything is an "Element" which corresponds to an html tag like p, div, head, body, etc
+# Everything is an "Element" which corresponds to an html tag like p, div, head, body, etc; A document is represented by it's root (or outermost) tag
-# Element's have a ".tag" attribute (not a function) which is the name of of the tag (like "p" or "script")
+# Element's have a '''.tag''' attribute (not a function) which is the name of of the tag (like "p" or "script")
-# Elements also have a ".text" which is the text inside the tag (up to the first child element)
+# Elements also have a '''.text''' which is the text inside the tag (up to the first child element)
-# Elements also have a ".tail" which is the text immediately following the tag, before the next tag.
+# Elements also have a '''.tail''' which is the text immediately following the tag, before the next tag.
 # Elements when used in a loop, act like a list of their child elements (the tags inside)
-# Elements have a get function that retrieves named attributes from the tag (like a dictionary)
+# Elements have a '''.get()''' function that retrieves named attributes from the tag (like a dictionary)
 While more sophisticated (and faster) libraries for working with HTML/XML exist (namely lxml), python's standard ElementTree implementation is quite capable (and using lxml requires a separate C module to be compiled on your platform which can sometimes complicate installation / distribution of your script).