ElementTree: Difference between revisions

From XPUB & Lens-Based wiki
No edit summary
No edit summary
Line 5: Line 5:
The basic rules:
The basic rules:


# Everything is an "Element" which corresponds to an html tag like p, div, head, body, etc
# Everything is an "Element" which corresponds to an html tag like p, div, head, body, etc; A document is represented by it's root (or outermost) tag
# Element's have a ".tag" attribute (not a function) which is the name of of the tag (like "p" or "script")
# Element's have a '''.tag''' attribute (not a function) which is the name of of the tag (like "p" or "script")
# Elements also have a ".text" which is the text inside the tag (up to the first child element)
# Elements also have a '''.text''' which is the text inside the tag (up to the first child element)
# Elements also have a ".tail" which is the text immediately following the tag, before the next tag.  
# Elements also have a '''.tail''' which is the text immediately following the tag, before the next tag.  
# Elements when used in a loop, act like a list of their child elements (the tags inside)
# Elements when used in a loop, act like a list of their child elements (the tags inside)
# Elements have a get function that retrieves named attributes from the tag (like a dictionary)
# Elements have a '''.get()''' function that retrieves named attributes from the tag (like a dictionary)


While more sophisticated (and faster) libraries for working with HTML/XML exist (namely lxml), python's standard ElementTree implementation is quite capable (and using lxml requires a separate C module to be compiled on your platform which can sometimes complicate installation / distribution of your script).
While more sophisticated (and faster) libraries for working with HTML/XML exist (namely lxml), python's standard ElementTree implementation is quite capable (and using lxml requires a separate C module to be compiled on your platform which can sometimes complicate installation / distribution of your script).

Revision as of 14:35, 27 May 2014

ElementTree is an interface to XML documents (so like SVG, and HTML when preprocessed with say html5lib).

It's kind of mix between a list and a dictionary and gives you relatively easy access to both read (to say "scrape" data from webpage) and write (to say alter an SVG or webpage and then re-output it).

The basic rules:

  1. Everything is an "Element" which corresponds to an html tag like p, div, head, body, etc; A document is represented by it's root (or outermost) tag
  2. Element's have a .tag attribute (not a function) which is the name of of the tag (like "p" or "script")
  3. Elements also have a .text which is the text inside the tag (up to the first child element)
  4. Elements also have a .tail which is the text immediately following the tag, before the next tag.
  5. Elements when used in a loop, act like a list of their child elements (the tags inside)
  6. Elements have a .get() function that retrieves named attributes from the tag (like a dictionary)

While more sophisticated (and faster) libraries for working with HTML/XML exist (namely lxml), python's standard ElementTree implementation is quite capable (and using lxml requires a separate C module to be compiled on your platform which can sometimes complicate installation / distribution of your script).

ElementTree supports a small subset of the (more extensive) xpath query language:

Syntax Meaning
tag Selects all child elements with the given tag.

For example, spam selects all child elements named spam, and spam/egg selects all grandchildren named egg in all children named spam.

* Selects all child elements. For example, */egg

selects all grandchildren named egg.

. Selects the current node. This is mostly useful

at the beginning of the path, to indicate that it’s a relative path.

// Selects all subelements, on all levels beneath the

current element. For example, .//egg selects all egg elements in the entire tree.

.. Selects the parent element.
[@attrib] Selects all elements that have the given attribute.
[@attrib='value'] Selects all elements for which the given attribute

has the given value. The value cannot contain quotes.

[tag] Selects all elements that have a child named

tag. Only immediate children are supported.

[position] Selects all elements that are located at the given

position. The position can be either an integer (1 is the first position), the expression last() (for the last position), or a position relative to the last position (e.g. last()-1).

Source: http://docs.python.org/2/library/xml.etree.elementtree.html#supported-xpath-syntax