BeautifulSoup
Beautiful Soup is a Python library for reading and manipulating HTML pages.
Unlike the various modules built in to Python for reading text and XML, BeautifulSoup is able to deal gracefully with many of the complexities of web pages "in the wild". So for instance, rather than totally freaking out when a tag is incorrectly closed (like an XML parser would), Soup does the best it can and returns something approximating what a browser like FireFox would display to a user. For this reason Soup is an invaluable tool for doing spidering / text analysis / extraction / remixing of web pages over which you have less than total control.
Cookbook (Python code examples)
Getting a Soup from a URL
A useful all-purpose way of reading the contents of a webpage.
It does two important things, beyond a simple URL open:
- Sets the "user_agent"
- Handles redirection
Uses urllib2.Request to set the "user_agent" string -- in this way the request will appear to be coming from a browser (A Linux version of Firefox in the example given below... but this could be changed to whatever). Some sites attempt to block "bots" by rejecting requests if the user-agent (browser) is not recognizied.
The urllib2.urlopen function deals with eventual redirection to a different page location, which is why the function returns both the soup and the "actual" URL. This "realurl" should be used in any subsequent substitution / absolutizing of URLs inside the page as this is where the page actually is.
def opensoup (url):
"""
returns (page, actualurl)
sets user_agent and resolves possible redirection
realurl maybe different than url in the case of a redirect
"""
request = urllib2.Request(url)
user_agent = "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) Gecko/20080418 Ubuntu/7.10 (gutsy) Firefox/2.0.0.14"
request.add_header("User-Agent", user_agent)
pagefile=urllib2.urlopen(request)
soup=BeautifulSoup.BeautifulSoup(pagefile)
realurl = pagefile.geturl()
pagefile.close()
return (soup, realurl)
Replace the contents of a tag
A function to replace the contents of a tag:
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup("<ul><li>one</li><li>two</li></ul>")
def setcontents (tag, val):
# remove previous contents
for c in tag.contents:
c.extract()
# insert the new
tag.insert(0, val)
items = soup.findAll("li")
for item in items:
setcontents(item, "foo")
print soup.prettify()
Wrap one (existing) tag inside of another (newly created) tag
import BeautifulSoup
def wraptag (tag, wrapper):
# <tag>contents</tag> ==> <wrapper><tag>contents</tag></wrapper>
tagIndex = tag.parent.contents.index(tag)
tag.parent.insert(tagIndex, wrapper)
wrapper.insert(0, tag)
# TEST CODE
soup = BeautifulSoup.BeautifulSoup("<ul><li>one</li><li>two</li></ul>")
items = soup.findAll("li")
for item in items:
div = BeautifulSoup.Tag(soup, "div")
wraptag(item, div)
print soup.prettify()
Code Questions
When "absolutizing", how to patch url's in stylesheets.