BeautifulSoup

From Media Design: Networked & Lens-Based wiki
Jump to navigation Jump to search

Beautiful Soup

Beautiful Soup is a poem by Lewis Carroll in the novel Alice in Wonderland. It's also a Python library for reading and manipulating HTML pages.

Unlike the various modules built in to Python for reading text and XML, BeautifulSoup is able to deal gracefully with many of the complexities of web pages "in the wild". So for instance, rather than totally freaking out when a tag is incorrectly closed (like an XML parser would), Soup does the best it can and returns something approximating what a browser like FireFox would display to a user. For this reason Soup is an invaluable tool for doing spidering / text analysis / extraction / remixing of web pages over which you have less than total control.

Cookbook (Python code examples)

Getting a Soup from a URL

A useful all-purpose way of reading the contents of a webpage.

It does two important things, beyond a simple URL open:

  1. Sets the "user_agent"
  2. Handles redirection

Uses urllib2.Request to set the "user_agent" string -- in this way the request will appear to be coming from a browser (A Linux version of Firefox in the example given below... but this could be changed to whatever). Some sites attempt to block "bots" by rejecting requests if the user-agent (browser) is not recognizied.

The urllib2.urlopen function deals with eventual redirection to a different page location, which is why the function returns both the soup and the "actual" URL. This "realurl" should be used in any subsequent substitution / absolutizing of URLs inside the page as this is where the page actually is.

def opensoup (url):
	"""
	returns (page, actualurl)
	sets user_agent and resolves possible redirection
	realurl maybe different than url in the case of a redirect
	"""	
	request = urllib2.Request(url)
	user_agent = "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) Gecko/20080418 Ubuntu/7.10 (gutsy) Firefox/2.0.0.14"
	request.add_header("User-Agent", user_agent)
	pagefile=urllib2.urlopen(request)
	soup=BeautifulSoup.BeautifulSoup(pagefile)
	realurl = pagefile.geturl()
	pagefile.close()
	return (soup, realurl)


Example Usage

import urllib2, BeautifulSoup

def opensoup (url):
	"""
	returns (page, actualurl)
	sets user_agent and resolves possible redirection
	realurl maybe different than url in the case of a redirect
	"""	
	request = urllib2.Request(url)
	user_agent = "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) Gecko/20080418 Ubuntu/7.10 (gutsy) Firefox/2.0.0.14"
	request.add_header("User-Agent", user_agent)
	pagefile=urllib2.urlopen(request)
	soup=BeautifulSoup.BeautifulSoup(pagefile)
	realurl = pagefile.geturl()
	pagefile.close()
	return (soup, realurl)


(soup, url) = opensoup("http://pzwart.wdka.hro.nl")
# print soup.prettify()
for link in soup.findAll("a"):
	print link

Replace the contents of a tag

A function to replace the contents of a tag:

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup("<ul><li>one</li><li>two</li></ul>")

def setcontents (tag, val):
	# remove previous contents
	for c in tag.contents:
		c.extract()
	# insert the new
	tag.insert(0, val)

items = soup.findAll("li")
for item in items:
	setcontents(item, "foo")

print soup.prettify()

Wrap one (existing) tag inside of another (newly created) tag

import BeautifulSoup

def wraptag (tag, wrapper):
	# <tag>contents</tag> ==> <wrapper><tag>contents</tag></wrapper>
	tagIndex = tag.parent.contents.index(tag)
	tag.parent.insert(tagIndex, wrapper)
	wrapper.insert(0, tag)

# TEST CODE
soup = BeautifulSoup.BeautifulSoup("<ul><li>one</li><li>two</li></ul>")
items = soup.findAll("li")
for item in items:	
	div = BeautifulSoup.Tag(soup, "div")
	wraptag(item, div)
print soup.prettify()

Code Questions

When "absolutizing", how to patch url's in stylesheets.