Syllabus 20100126

Last week when processing RSS feeds we ran into some issues translating stuff into perfect XML that would be accepted as an ePub. Turns out there are quite a few tools to help do this work in Python. More interestingly, these same tools are designed to help "scape" pages from the wild (ie that maybe haven't been designed to be read as a feed or via a special API).

XML Parsing in Python

There are some (potentially) useful modules built-in to Python to work with XML.

Example using built-in expat module.

Example using the minidom parser.

To deal with potentially messy "real-world" cases, however, it usually makes sense to use extra libraries designed to be more tolerant and capable of correcting small mistakes in structure.

html5lib
BeautifulSoup
An example of spidering a webpage to find and download images.

This week's code:

import feedparser
import urllib2, urlparse, os, sys
import html5lib


def openURL (url):
    """
    returns (page, actualurl)
    sets user_agent and resolves possible redirection
    realurl maybe different than url in the case of a redirect
    """    
    request = urllib2.Request(url)
    user_agent = "Mozilla/5.0 (X11; U; Linux x86_64; fr; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5.5"
    request.add_header("User-Agent", user_agent)
    pagefile=urllib2.urlopen(request)
    realurl = pagefile.geturl()
    return (pagefile, realurl)


(f,url)= openURL ("http://www.nrcnext.nl/blog/2010/01/26/verdubbeling-aantal-verkochte-iphones/")

parser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("dom"))
tree = parser.parse(f)
f.close()
tree.normalize()
for node in tree.getElementsByTagName("img"):
    src = node.getAttribute("src")
    print src

sys.exit()

newwork = feedparser.parse("http://feeds.nrcnext.nl/nrcnext-blog")

#import pprint
#pprint.pprint(newwork.entries[0])
#import sys
#sys.exit()

print """
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>feedbook</title>
    <link type="text/css" rel="stylesheet" media="all" href="stylesheet.css" />
  </head>
  <body>"""

for e in newwork.entries:
    print e["title"].replace("&", "&amp;").encode("utf-8")
    print e["link"]
print"""
  </body>
</html>"""