The following function shows how to use Python to load the source of a web page.
import urllib2 def openURL (url, user_agent="Mozilla/5.0 (X11; U; Linux x86_64; fr; rv:188.8.131.52) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5.5"): """ Returns: tuple with (file, actualurl) sets user_agent & follows redirection if necessary realurl maybe different than url in the case of a redirect """ request = urllib2.Request(url) if user_agent: request.add_header("User-Agent", user_agent) pagefile=urllib2.urlopen(request) realurl = pagefile.geturl() return (pagefile, realurl)
To use the function, to for instance load the CNN homepage:
(f, url) = openURL("http://www.cnn.com")
The returned object (named f above), is a file-like object, that is it behaves like a normal Python file object. To work with a library like html5lib, you would pass this object to the library's parse function (see PythonHtml5lib).
To simply read the contents of the page as a string, you simply call the file object's read function.
pagecontents = f.read() print pagecontents