PythonOpenURL
The following function shows how to use Python to load the source of a web page.
import urllib2
def openURL (url, user_agent="Mozilla/5.0 (X11; U; Linux x86_64; fr; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5.5"):
"""
Returns: tuple with (file, actualurl)
sets user_agent & follows redirection if necessary
realurl maybe different than url in the case of a redirect
"""
request = urllib2.Request(url)
if user_agent:
request.add_header("User-Agent", user_agent)
pagefile=urllib2.urlopen(request)
realurl = pagefile.geturl()
return (pagefile, realurl)
To use the function, to for instance load the CNN homepage:
(f, url) = openURL("http://www.cnn.com")
The returned object (named f above), is a file-like object, that is it behaves like a normal Python file object. To work with a library like html5lib, you would pass this object to the library's parse function (see PythonHtml5lib).
To simply read the contents of the page as a string, you simply call the file object's read function.
pagecontents = f.read()
print pagecontents