PythonOpenURL

From XPUB & Lens-Based wiki
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

The following function shows how to use Python to load the source of a web page.

import urllib2

def openURL (url, user_agent="Mozilla/5.0 (X11; U; Linux x86_64; fr; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5.5"):
    """
    Returns: tuple with (file, actualurl)
    sets user_agent & follows redirection if necessary
    realurl maybe different than url in the case of a redirect
    """    
    request = urllib2.Request(url)
    if user_agent:
        request.add_header("User-Agent", user_agent)
    pagefile=urllib2.urlopen(request)
    realurl = pagefile.geturl()
    return (pagefile, realurl)


To use the function, to for instance load the CNN homepage:

(f, url) = openURL("http://www.cnn.com")


The returned object (named f above), is a file-like object, that is it behaves like a normal Python file object. To work with a library like html5lib, you would pass this object to the library's parse function (see PythonHtml5lib).

To simply read the contents of the page as a string, you simply call the file object's read function.

pagecontents = f.read()
print pagecontents