ElementTree: Difference between revisions

From XPUB & Lens-Based wiki
No edit summary
 
(11 intermediate revisions by the same user not shown)
Line 1: Line 1:
ElementTree is an interface to XML documents (so like SVG, and HTML when preprocessed with say [[html5lib]]).
ET lets you treat documents like a mix between a list and a dictionary and gives relatively easy access to both ''reading'' (to say "scrape" data from webpages) and ''writing'' (to say alter an SVG or fuse together multiple webpages to then resave as text).
The basic rules:
# Everything is an "Element" which corresponds to an html tag like p, div, head, body, etc; A document is represented by it's root (or outermost) tag
# Element's have a '''.tag''' attribute (not a function) which is the name of of the tag (like "p" or "script")
# Elements also have a '''.text''' which is the text inside the tag (up to the first child element)
# Elements also have a '''.tail''' which is the text immediately following the tag, before the next tag.
# Elements when used in a loop, act like a list of their child elements (the tags inside)
# Elements have a '''.get()''' function that retrieves named attributes from the tag (like a dictionary)
While more sophisticated (and faster) libraries for working with HTML/XML exist (namely lxml), python's standard ElementTree implementation is quite capable (and using lxml requires a separate C module to be compiled on your platform which can sometimes complicate installation / distribution of your script).
While more sophisticated (and faster) libraries for working with HTML/XML exist (namely lxml), python's standard ElementTree implementation is quite capable (and using lxml requires a separate C module to be compiled on your platform which can sometimes complicate installation / distribution of your script).


ElementTree supports a small subset of the (more extensive) xpath query language:
== Reading some text ==
The '''.fromstring()''' function can be used to convert HTML or other XML formatted text into an ElementTree.
 
<source lang="python">
from xml.etree import ElementTree as ET
t = ET.fromstring("<p>Hello world!</p>")
</source>
 
== Reading HTML from "the wild" ==
 
Use [[html5lib]] to read HTML pages as element trees.
 
<source lang="python">
import html5lib
from urllib2 import urlopen
f = urlopen("http://en.wikipedia.org/wiki/Gunny_sack")
t = html5lib.parse(f, namespaceHTMLElements=False, treebuilder="etree")
t.find(".//h1")
print t
</source>
 
== .iter() ==
The iter() function "walks" over all the elements of a tree.
 
== .find() and .findall() ==
ElementTree supports a small subset of the (more extensive) xpath query language via the '''.find()''' and '''.findall()''' functions.


{|class="wikitable"  
{|class="wikitable"  
Line 52: Line 90:


Source: http://docs.python.org/2/library/xml.etree.elementtree.html#supported-xpath-syntax
Source: http://docs.python.org/2/library/xml.etree.elementtree.html#supported-xpath-syntax
== ... and back to text ==
<source lang="python">
from xml.etree import ElementTree as ET
print ET.tostring(x, method="text", encoding="utf-8")
print ET.tostring(x)
</source>
== Getting at the parent ==
A common limitation of elementtree is that element's don't themselves have a link to their parent element. This can cause problems when for instance trying to remove elements. A simple fix is to use an iterator that gives access...
Source: http://effbot.org/zone/element.htm
<source lang="python">
def iterparent(tree):
    for parent in tree.getiterator():
        for child in parent:
            yield parent, child
</source>
<source lang="python">
for parent, child in iterparent(tree):
    ... work on parent/child tuple
</source>
== Absolutizing Href's and Src's (incomplete... still buggy) ==
<source lang="python">
import html5lib
from urlparse import urljoin
t = html5lib.parse(r.text, treebuilder="etree", namespaceHTMLElements=False)
def myurljoin (base, href):
    if href.startswith("//"):
        href = urlparse(base).scheme+":"+href
    return urljoin(base, href)
for link in t.findall(".//*[@href]"):
    href = link.get("href")
    if href != None:
        link.attrib['href'] = urljoin(base, href)
for elt in t.findall(".//*[@src]"):
    href = link.get("src")
    if href != None:
        link.attrib['src'] = urljoin(base, href)
</source>

Latest revision as of 15:56, 6 October 2014

ElementTree is an interface to XML documents (so like SVG, and HTML when preprocessed with say html5lib).

ET lets you treat documents like a mix between a list and a dictionary and gives relatively easy access to both reading (to say "scrape" data from webpages) and writing (to say alter an SVG or fuse together multiple webpages to then resave as text).

The basic rules:

  1. Everything is an "Element" which corresponds to an html tag like p, div, head, body, etc; A document is represented by it's root (or outermost) tag
  2. Element's have a .tag attribute (not a function) which is the name of of the tag (like "p" or "script")
  3. Elements also have a .text which is the text inside the tag (up to the first child element)
  4. Elements also have a .tail which is the text immediately following the tag, before the next tag.
  5. Elements when used in a loop, act like a list of their child elements (the tags inside)
  6. Elements have a .get() function that retrieves named attributes from the tag (like a dictionary)

While more sophisticated (and faster) libraries for working with HTML/XML exist (namely lxml), python's standard ElementTree implementation is quite capable (and using lxml requires a separate C module to be compiled on your platform which can sometimes complicate installation / distribution of your script).

Reading some text

The .fromstring() function can be used to convert HTML or other XML formatted text into an ElementTree.

from xml.etree import ElementTree as ET
t = ET.fromstring("<p>Hello world!</p>")

Reading HTML from "the wild"

Use html5lib to read HTML pages as element trees.

import html5lib
from urllib2 import urlopen
f = urlopen("http://en.wikipedia.org/wiki/Gunny_sack")
t = html5lib.parse(f, namespaceHTMLElements=False, treebuilder="etree")
t.find(".//h1")
print t

.iter()

The iter() function "walks" over all the elements of a tree.

.find() and .findall()

ElementTree supports a small subset of the (more extensive) xpath query language via the .find() and .findall() functions.

Syntax Meaning
tag Selects all child elements with the given tag.

For example, spam selects all child elements named spam, and spam/egg selects all grandchildren named egg in all children named spam.

* Selects all child elements. For example, */egg

selects all grandchildren named egg.

. Selects the current node. This is mostly useful

at the beginning of the path, to indicate that it’s a relative path.

// Selects all subelements, on all levels beneath the

current element. For example, .//egg selects all egg elements in the entire tree.

.. Selects the parent element.
[@attrib] Selects all elements that have the given attribute.
[@attrib='value'] Selects all elements for which the given attribute

has the given value. The value cannot contain quotes.

[tag] Selects all elements that have a child named

tag. Only immediate children are supported.

[position] Selects all elements that are located at the given

position. The position can be either an integer (1 is the first position), the expression last() (for the last position), or a position relative to the last position (e.g. last()-1).

Source: http://docs.python.org/2/library/xml.etree.elementtree.html#supported-xpath-syntax

... and back to text

from xml.etree import ElementTree as ET

print ET.tostring(x, method="text", encoding="utf-8")
print ET.tostring(x)

Getting at the parent

A common limitation of elementtree is that element's don't themselves have a link to their parent element. This can cause problems when for instance trying to remove elements. A simple fix is to use an iterator that gives access...

Source: http://effbot.org/zone/element.htm

def iterparent(tree):
    for parent in tree.getiterator():
        for child in parent:
            yield parent, child
for parent, child in iterparent(tree):
    ... work on parent/child tuple

Absolutizing Href's and Src's (incomplete... still buggy)

import html5lib
from urlparse import urljoin

t = html5lib.parse(r.text, treebuilder="etree", namespaceHTMLElements=False)

def myurljoin (base, href):
    if href.startswith("//"):
        href = urlparse(base).scheme+":"+href
    return urljoin(base, href)

for link in t.findall(".//*[@href]"):
    href = link.get("href")
    if href != None:
        link.attrib['href'] = urljoin(base, href)

for elt in t.findall(".//*[@src]"):
    href = link.get("src")
    if href != None:
        link.attrib['src'] = urljoin(base, href)