Syllabus 20100126
Last week when processing RSS feeds we ran into some issues translating stuff into perfect XML that would be accepted as an ePub. Turns out there are quite a few tools to help do this work in Python. More interestingly, these same tools are designed to help "scape" pages from the wild (ie that maybe haven't been designed to be read as a feed or via a special API).
XML Parsing in Python
There are some (potentially) useful modules built-in to Python to work with XML.
Example using built-in expat module.
Example using the minidom parser.
To deal with potentially messy "real-world" cases, however, it usually makes sense to use extra libraries designed to be more tolerant and capable of correcting small mistakes in structure.
- html5lib
- BeautifulSoup
- An example of spidering a webpage to find and download images.
This week's code:
import feedparser
import urllib2, urlparse, os, sys
import html5lib
def openURL (url):
"""
returns (page, actualurl)
sets user_agent and resolves possible redirection
realurl maybe different than url in the case of a redirect
"""
request = urllib2.Request(url)
user_agent = "Mozilla/5.0 (X11; U; Linux x86_64; fr; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5.5"
request.add_header("User-Agent", user_agent)
pagefile=urllib2.urlopen(request)
realurl = pagefile.geturl()
return (pagefile, realurl)
(f,url)= openURL ("http://www.nrcnext.nl/blog/2010/01/26/verdubbeling-aantal-verkochte-iphones/")
parser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("dom"))
tree = parser.parse(f)
f.close()
tree.normalize()
for node in tree.getElementsByTagName("img"):
src = node.getAttribute("src")
print src
sys.exit()
newwork = feedparser.parse("http://feeds.nrcnext.nl/nrcnext-blog")
#import pprint
#pprint.pprint(newwork.entries[0])
#import sys
#sys.exit()
print """
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>feedbook</title>
<link type="text/css" rel="stylesheet" media="all" href="stylesheet.css" />
</head>
<body>"""
for e in newwork.entries:
print e["title"].replace("&", "&").encode("utf-8")
print e["link"]
print"""
</body>
</html>"""