BeautifulSoup: Difference between revisions

From XPUB & Lens-Based wiki
No edit summary
 
(12 intermediate revisions by 2 users not shown)
Line 1: Line 1:
Beautiful Soup is a Python library for manipulating HTML pages.
= Beautiful Soup =


* [http://www.crummy.com/software/BeautifulSoup/ official website]
Beautiful Soup is a poem by Lewis Carroll in the novel Alice in Wonderland. It's also a Python library for reading and manipulating HTML pages.
* [http://www.crummy.com/software/BeautifulSoup/documentation.html documentation]


== Code Examples ==
Unlike the various modules built in to Python for reading text and XML, BeautifulSoup is able to deal gracefully with many of the complexities of web pages "in the wild". So for instance, rather than totally freaking out when a tag is incorrectly closed (like an XML parser would), Soup does the best it can and returns something approximating what a browser like FireFox would display to a user. For this reason Soup is an invaluable tool for doing spidering / text analysis / extraction / remixing of web pages over which you have less than total control.
* [http://www.crummy.com/software/BeautifulSoup/ BeautifulSoup website]
* [http://www.crummy.com/software/BeautifulSoup/documentation.html Documentation]


=== opensoup ===
== Cookbook (Python code examples) ==


A useful all-purpose way of reading the contents of a webpage (nb: it makes use of the urllib2 module):
=== Getting a Soup from a URL ===
NB: the urllib2 module is used to connect to a page -- it deals with eventual redirection to a different page location, which is why the function returns both the soup and the "actual" URL. This "realurl" should be used in any subsequent substitution / absolutizing of URLs inside the page as this is where the page actually is.
 
A useful all-purpose way of reading the contents of a webpage.
 
It does two important things, beyond a simple URL open:
# Sets the "user_agent"
# Handles redirection
 
Uses '''urllib2.Request''' to set the "user_agent" string -- in this way the request will appear to be coming from a browser (A Linux version of Firefox in the example given below... but this could be changed to whatever). Some sites attempt to block "bots" by rejecting requests if the user-agent (browser) is not recognizied.
 
The urllib2.urlopen function deals with eventual redirection to a different page location, which is why the function returns both the soup and the "actual" URL. This "realurl" should be used in any subsequent substitution / absolutizing of URLs inside the page as this is where the page actually is.


<source lang="python">
<source lang="python">
Line 29: Line 39:




=== replace the contents of a tag ===
==== Example Usage ====
 
<source lang="python">
import urllib2, BeautifulSoup
 
def opensoup (url):
"""
returns (page, actualurl)
sets user_agent and resolves possible redirection
realurl maybe different than url in the case of a redirect
"""
request = urllib2.Request(url)
user_agent = "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) Gecko/20080418 Ubuntu/7.10 (gutsy) Firefox/2.0.0.14"
request.add_header("User-Agent", user_agent)
pagefile=urllib2.urlopen(request)
soup=BeautifulSoup.BeautifulSoup(pagefile)
realurl = pagefile.geturl()
pagefile.close()
return (soup, realurl)
 
 
(soup, url) = opensoup("http://pzwart.wdka.hro.nl")
# print soup.prettify()
for link in soup.findAll("a"):
print link
</source>
 
=== Replace the contents of a tag ===
   
   
A function to replace the contents of a tag:
A function to replace the contents of a tag:
Line 50: Line 87:
</source>
</source>


 
=== Wrap one (existing) tag inside of another (newly created) tag ===
=== wrap one tag inside of another ===


<source lang="python">
<source lang="python">
Line 57: Line 93:


def wraptag (tag, wrapper):
def wraptag (tag, wrapper):
# wraps tag with wrapper
# <tag>contents</tag> ==> <wrapper><tag>contents</tag></wrapper>
tagIndex = tag.parent.contents.index(tag)
tagIndex = tag.parent.contents.index(tag)
tag.parent.insert(tagIndex, wrapper)
tag.parent.insert(tagIndex, wrapper)
Line 73: Line 109:
== Code Questions ==
== Code Questions ==


* absolutize function needs ability to patch url's in stylesheets.
When "absolutizing", how to patch url's in stylesheets.
 
[[Category:Cookbook]]

Latest revision as of 20:31, 23 September 2010

Beautiful Soup

Beautiful Soup is a poem by Lewis Carroll in the novel Alice in Wonderland. It's also a Python library for reading and manipulating HTML pages.

Unlike the various modules built in to Python for reading text and XML, BeautifulSoup is able to deal gracefully with many of the complexities of web pages "in the wild". So for instance, rather than totally freaking out when a tag is incorrectly closed (like an XML parser would), Soup does the best it can and returns something approximating what a browser like FireFox would display to a user. For this reason Soup is an invaluable tool for doing spidering / text analysis / extraction / remixing of web pages over which you have less than total control.

Cookbook (Python code examples)

Getting a Soup from a URL

A useful all-purpose way of reading the contents of a webpage.

It does two important things, beyond a simple URL open:

  1. Sets the "user_agent"
  2. Handles redirection

Uses urllib2.Request to set the "user_agent" string -- in this way the request will appear to be coming from a browser (A Linux version of Firefox in the example given below... but this could be changed to whatever). Some sites attempt to block "bots" by rejecting requests if the user-agent (browser) is not recognizied.

The urllib2.urlopen function deals with eventual redirection to a different page location, which is why the function returns both the soup and the "actual" URL. This "realurl" should be used in any subsequent substitution / absolutizing of URLs inside the page as this is where the page actually is.

def opensoup (url):
	"""
	returns (page, actualurl)
	sets user_agent and resolves possible redirection
	realurl maybe different than url in the case of a redirect
	"""	
	request = urllib2.Request(url)
	user_agent = "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) Gecko/20080418 Ubuntu/7.10 (gutsy) Firefox/2.0.0.14"
	request.add_header("User-Agent", user_agent)
	pagefile=urllib2.urlopen(request)
	soup=BeautifulSoup.BeautifulSoup(pagefile)
	realurl = pagefile.geturl()
	pagefile.close()
	return (soup, realurl)


Example Usage

import urllib2, BeautifulSoup

def opensoup (url):
	"""
	returns (page, actualurl)
	sets user_agent and resolves possible redirection
	realurl maybe different than url in the case of a redirect
	"""	
	request = urllib2.Request(url)
	user_agent = "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) Gecko/20080418 Ubuntu/7.10 (gutsy) Firefox/2.0.0.14"
	request.add_header("User-Agent", user_agent)
	pagefile=urllib2.urlopen(request)
	soup=BeautifulSoup.BeautifulSoup(pagefile)
	realurl = pagefile.geturl()
	pagefile.close()
	return (soup, realurl)


(soup, url) = opensoup("http://pzwart.wdka.hro.nl")
# print soup.prettify()
for link in soup.findAll("a"):
	print link

Replace the contents of a tag

A function to replace the contents of a tag:

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup("<ul><li>one</li><li>two</li></ul>")

def setcontents (tag, val):
	# remove previous contents
	for c in tag.contents:
		c.extract()
	# insert the new
	tag.insert(0, val)

items = soup.findAll("li")
for item in items:
	setcontents(item, "foo")

print soup.prettify()

Wrap one (existing) tag inside of another (newly created) tag

import BeautifulSoup

def wraptag (tag, wrapper):
	# <tag>contents</tag> ==> <wrapper><tag>contents</tag></wrapper>
	tagIndex = tag.parent.contents.index(tag)
	tag.parent.insert(tagIndex, wrapper)
	wrapper.insert(0, tag)

# TEST CODE
soup = BeautifulSoup.BeautifulSoup("<ul><li>one</li><li>two</li></ul>")
items = soup.findAll("li")
for item in items:	
	div = BeautifulSoup.Tag(soup, "div")
	wraptag(item, div)
print soup.prettify()

Code Questions

When "absolutizing", how to patch url's in stylesheets.