PythonOpenURL

From XPUB & Lens-Based wiki

The following function shows how to use Python to load the source of a web page.

import urllib2

def openURL (url, user_agent="Mozilla/5.0 (X11; U; Linux x86_64; fr; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5.5"):
    """
    Returns: tuple with (file, actualurl)
    sets user_agent & follows redirection if necessary
    realurl maybe different than url in the case of a redirect
    """    
    request = urllib2.Request(url)
    if user_agent:
        request.add_header("User-Agent", user_agent)
    pagefile=urllib2.urlopen(request)
    realurl = pagefile.geturl()
    return (pagefile, realurl)


To use the function, to for instance load the CNN homepage:

(f, url) = openURL("http://www.cnn.com")


The returned object (named f above), is a file-like object, that is it behaves like a normal Python file object. To work with a library like html5lib, you would pass this object to the library's parse function (see PythonHtml5lib).

To simply read the contents of the page as a string, you simply call the file object's read function.

pagecontents = f.read()
print pagecontents