PythonHtml5lib

From XPUB & Lens-Based wiki
Revision as of 20:32, 23 September 2010 by Migratebot (talk | contribs) (Created page with "HTML5lib is a good Python library for working with web pages, especially pages "in the wild" where small errors / missing pieces in a web page may through off other libraries (su...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

HTML5lib is a good Python library for working with web pages, especially pages "in the wild" where small errors / missing pieces in a web page may through off other libraries (such as a strict XML parser). In fact, you could use the library to "repair" broken pages.

Scraping a list of images out of a web page

The code

Note that we make use of the openURL function from PythonOpenURL.

import urllib2, html5lib, urlparse

def openURL (url):
    """
    returns (page, url)
    sets user_agent and resolves possible redirection
    returned url may be different than initial url in the case of a redirect
    """    
    request = urllib2.Request(url)
    user_agent = "Mozilla/5.0 (X11; U; Linux x86_64; fr; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5.5"
    request.add_header("User-Agent", user_agent)
    pagefile=urllib2.urlopen(request)
    realurl = pagefile.geturl()
    return (pagefile, realurl)

def getImagesFromWeb (url):
    """
    returns: a list of absolute URLs of the src's found in <img> tags at the given URL
    requires: URL should be of an HTML page
    """
    parser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("dom"))
    (f, url) = openURL(url)
    tree = parser.parse(f)
    tree.normalize()
    ret = []    
    for img in tree.getElementsByTagName("img"):
        src = img.getAttribute("src")
        if not src.startswith("http://"):
            src = urlparse.urljoin(url, src)
        ret.append(src)
    f.close()
    return ret


Usage:

bbcimages = getImagesFromWeb("http://news.bbc.co.uk")


And then wrap them in image tags with something like:

for img in bbcimages:
    print "<img src='" + img + "' />"


Or using list comprehensions (for those inclined to "one-liners"):

print "\n".join(["<img src='%s' />" % src for src in bbcimages])


print "\n".join(["<img src='%s' />" % src for src in bbcimages])


Discussion

Using html5lib, we can create a "dom tree" from a web page and then use the function getElementsByTagName to get a list of a particular kind of HTML tag, such as "img" for image tags.

To begin we import the library, and create a "parser" -- in effect a machine that will chew the text of a webpage into a nicely structured DOM tree:

    parser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("dom"))


We can then "normalize" this result -- effectively cleaning up the page, fixing possible problems and increasing the consistency of how things are stuctured to make it easier to work with different source for instance.

    tree.normalize()


Next we make an empty list to collect results that we will return. We call the probably 2nd most popular DOM function of all time getElementsByTagName (after getElementByID of course)... and loop over the results and extract the "src" attribute using getAttribute (the URL) from each image tag.

    ret = []    
    for img in tree.getElementsByTagName("img"):
        src = img.getAttribute("src")

We check if the URL is absolute (does it start with "http://") -- if not we use Python's urljoin function to make an absolute URL based on the page URL. Then we add it to the list.

        if not src.startswith("http://"):
            src = urlparse.urljoin(url, src)
        ret.append(src)


Finally we clean up (close the file) and return the list of image URLs.

    f.close()
    return ret


def getImagesFromHTML (html, url=""):
    """
    returns: a list of absolute URLs of the src's found in <img> tags at the given URL
    requires: URL should be of an HTML page
    """
    parser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("dom"))

    tree = parser.parse(html)
    tree.normalize()
    ret = []    
    for img in tree.getElementsByTagName("img"):
        src = img.getAttribute("src")
        if not src.startswith("http://"):
            src = urlparse.urljoin(url, src)
        ret.append(src)
    f.close()
    return ret