From Media Design: Networked & Lens-Based wiki
Jump to navigation Jump to search

The following function shows how to use Python to load the source of a web page.

import urllib2

def openURL (url, user_agent="Mozilla/5.0 (X11; U; Linux x86_64; fr; rv: Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5.5"):
    Returns: tuple with (file, actualurl)
    sets user_agent & follows redirection if necessary
    realurl maybe different than url in the case of a redirect
    request = urllib2.Request(url)
    if user_agent:
        request.add_header("User-Agent", user_agent)
    realurl = pagefile.geturl()
    return (pagefile, realurl)

To use the function, to for instance load the CNN homepage:

(f, url) = openURL("")

The returned object (named f above), is a file-like object, that is it behaves like a normal Python file object. To work with a library like html5lib, you would pass this object to the library's parse function (see PythonHtml5lib).

To simply read the contents of the page as a string, you simply call the file object's read function.

pagecontents =
print pagecontents