BeautifulSoup: Difference between revisions

From XPUB & Lens-Based wiki
Line 5: Line 5:


== Code Examples ==
== Code Examples ==
=== opensoup ===


A useful all-purpose way of reading the contents of a webpage (nb: it makes use of the urllib2 module):
A useful all-purpose way of reading the contents of a webpage (nb: it makes use of the urllib2 module):
Line 26: Line 28:
</source>
</source>


=== replace the contents of a tag ===
A function to replace the contents of a tag:
A function to replace the contents of a tag:
<source lang="python">
<source lang="python">
Line 45: Line 50:
</source>
</source>


A function to wrap one tag inside of another one.
 
=== wrap one tag inside of another ===


<source lang="python">
<source lang="python">
Line 56: Line 62:
wrapper.insert(0, tag)
wrapper.insert(0, tag)


# TEST CODE
soup = BeautifulSoup.BeautifulSoup("<ul><li>one</li><li>two</li></ul>")
soup = BeautifulSoup.BeautifulSoup("<ul><li>one</li><li>two</li></ul>")
items = soup.findAll("li")
items = soup.findAll("li")
Line 61: Line 68:
div = BeautifulSoup.Tag(soup, "div")
div = BeautifulSoup.Tag(soup, "div")
wraptag(item, div)
wraptag(item, div)
print soup.prettify()
print soup.prettify()
</source>
</source>



Revision as of 13:52, 23 August 2008

Beautiful Soup is a Python library for manipulating HTML pages.

Code Examples

opensoup

A useful all-purpose way of reading the contents of a webpage (nb: it makes use of the urllib2 module): NB: the urllib2 module is used to connect to a page -- it deals with eventual redirection to a different page location, which is why the function returns both the soup and the "actual" URL. This "realurl" should be used in any subsequent substitution / absolutizing of URLs inside the page as this is where the page actually is.

def opensoup (url):
	"""
	returns (page, actualurl)
	sets user_agent and resolves possible redirection
	realurl maybe different than url in the case of a redirect
	"""	
	request = urllib2.Request(url)
	user_agent = "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) Gecko/20080418 Ubuntu/7.10 (gutsy) Firefox/2.0.0.14"
	request.add_header("User-Agent", user_agent)
	pagefile=urllib2.urlopen(request)
	soup=BeautifulSoup.BeautifulSoup(pagefile)
	realurl = pagefile.geturl()
	pagefile.close()
	return (soup, realurl)


replace the contents of a tag

A function to replace the contents of a tag:

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup("<ul><li>one</li><li>two</li></ul>")

def setcontents (tag, val):
	# remove previous contents
	for c in tag.contents:
		c.extract()
	# insert the new
	tag.insert(0, val)

items = soup.findAll("li")
for item in items:
	setcontents(item, "foo")

print soup.prettify()


wrap one tag inside of another

import BeautifulSoup

def wraptag (tag, wrapper):
	# wraps tag with wrapper
	tagIndex = tag.parent.contents.index(tag)
	tag.parent.insert(tagIndex, wrapper)
	wrapper.insert(0, tag)

# TEST CODE
soup = BeautifulSoup.BeautifulSoup("<ul><li>one</li><li>two</li></ul>")
items = soup.findAll("li")
for item in items:	
	div = BeautifulSoup.Tag(soup, "div")
	wraptag(item, div)
print soup.prettify()

Code Questions

  • absolutize function needs ability to patch url's in stylesheets.