BeautifulSoup: Difference between revisions
No edit summary |
Migratebot (talk | contribs) No edit summary |
||
(16 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
Beautiful Soup | = Beautiful Soup = | ||
Beautiful Soup is a poem by Lewis Carroll in the novel Alice in Wonderland. It's also a Python library for reading and manipulating HTML pages. | |||
Unlike the various modules built in to Python for reading text and XML, BeautifulSoup is able to deal gracefully with many of the complexities of web pages "in the wild". So for instance, rather than totally freaking out when a tag is incorrectly closed (like an XML parser would), Soup does the best it can and returns something approximating what a browser like FireFox would display to a user. For this reason Soup is an invaluable tool for doing spidering / text analysis / extraction / remixing of web pages over which you have less than total control. | |||
* [http://www.crummy.com/software/BeautifulSoup/ BeautifulSoup website] | |||
* [http://www.crummy.com/software/BeautifulSoup/documentation.html Documentation] | |||
== Cookbook (Python code examples) == | |||
=== Getting a Soup from a URL === | |||
A useful all-purpose way of reading the contents of a webpage. | |||
It does two important things, beyond a simple URL open: | |||
# Sets the "user_agent" | |||
# Handles redirection | |||
Uses '''urllib2.Request''' to set the "user_agent" string -- in this way the request will appear to be coming from a browser (A Linux version of Firefox in the example given below... but this could be changed to whatever). Some sites attempt to block "bots" by rejecting requests if the user-agent (browser) is not recognizied. | |||
The urllib2.urlopen function deals with eventual redirection to a different page location, which is why the function returns both the soup and the "actual" URL. This "realurl" should be used in any subsequent substitution / absolutizing of URLs inside the page as this is where the page actually is. | |||
<source lang="python"> | |||
def opensoup (url): | |||
""" | |||
returns (page, actualurl) | |||
sets user_agent and resolves possible redirection | |||
realurl maybe different than url in the case of a redirect | |||
""" | |||
request = urllib2.Request(url) | |||
user_agent = "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) Gecko/20080418 Ubuntu/7.10 (gutsy) Firefox/2.0.0.14" | |||
request.add_header("User-Agent", user_agent) | |||
pagefile=urllib2.urlopen(request) | |||
soup=BeautifulSoup.BeautifulSoup(pagefile) | |||
realurl = pagefile.geturl() | |||
pagefile.close() | |||
return (soup, realurl) | |||
</source> | |||
==== Example Usage ==== | |||
<source lang="python"> | |||
import urllib2, BeautifulSoup | |||
def opensoup (url): | |||
""" | |||
returns (page, actualurl) | |||
sets user_agent and resolves possible redirection | |||
realurl maybe different than url in the case of a redirect | |||
""" | |||
request = urllib2.Request(url) | |||
user_agent = "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) Gecko/20080418 Ubuntu/7.10 (gutsy) Firefox/2.0.0.14" | |||
request.add_header("User-Agent", user_agent) | |||
pagefile=urllib2.urlopen(request) | |||
soup=BeautifulSoup.BeautifulSoup(pagefile) | |||
realurl = pagefile.geturl() | |||
pagefile.close() | |||
return (soup, realurl) | |||
(soup, url) = opensoup("http://pzwart.wdka.hro.nl") | |||
# print soup.prettify() | |||
for link in soup.findAll("a"): | |||
print link | |||
</source> | |||
=== Replace the contents of a tag === | |||
A function to replace the contents of a tag: | A function to replace the contents of a tag: | ||
<source lang="python"> | <source lang="python"> | ||
Line 24: | Line 86: | ||
print soup.prettify() | print soup.prettify() | ||
</source> | </source> | ||
=== Wrap one (existing) tag inside of another (newly created) tag === | |||
<source lang="python"> | |||
import BeautifulSoup | |||
def wraptag (tag, wrapper): | |||
# <tag>contents</tag> ==> <wrapper><tag>contents</tag></wrapper> | |||
tagIndex = tag.parent.contents.index(tag) | |||
tag.parent.insert(tagIndex, wrapper) | |||
wrapper.insert(0, tag) | |||
# TEST CODE | |||
soup = BeautifulSoup.BeautifulSoup("<ul><li>one</li><li>two</li></ul>") | |||
items = soup.findAll("li") | |||
for item in items: | |||
div = BeautifulSoup.Tag(soup, "div") | |||
wraptag(item, div) | |||
print soup.prettify() | |||
</source> | |||
== Code Questions == | |||
When "absolutizing", how to patch url's in stylesheets. | |||
[[Category:Cookbook]] |
Latest revision as of 20:31, 23 September 2010
Beautiful Soup
Beautiful Soup is a poem by Lewis Carroll in the novel Alice in Wonderland. It's also a Python library for reading and manipulating HTML pages.
Unlike the various modules built in to Python for reading text and XML, BeautifulSoup is able to deal gracefully with many of the complexities of web pages "in the wild". So for instance, rather than totally freaking out when a tag is incorrectly closed (like an XML parser would), Soup does the best it can and returns something approximating what a browser like FireFox would display to a user. For this reason Soup is an invaluable tool for doing spidering / text analysis / extraction / remixing of web pages over which you have less than total control.
Cookbook (Python code examples)
Getting a Soup from a URL
A useful all-purpose way of reading the contents of a webpage.
It does two important things, beyond a simple URL open:
- Sets the "user_agent"
- Handles redirection
Uses urllib2.Request to set the "user_agent" string -- in this way the request will appear to be coming from a browser (A Linux version of Firefox in the example given below... but this could be changed to whatever). Some sites attempt to block "bots" by rejecting requests if the user-agent (browser) is not recognizied.
The urllib2.urlopen function deals with eventual redirection to a different page location, which is why the function returns both the soup and the "actual" URL. This "realurl" should be used in any subsequent substitution / absolutizing of URLs inside the page as this is where the page actually is.
def opensoup (url):
"""
returns (page, actualurl)
sets user_agent and resolves possible redirection
realurl maybe different than url in the case of a redirect
"""
request = urllib2.Request(url)
user_agent = "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) Gecko/20080418 Ubuntu/7.10 (gutsy) Firefox/2.0.0.14"
request.add_header("User-Agent", user_agent)
pagefile=urllib2.urlopen(request)
soup=BeautifulSoup.BeautifulSoup(pagefile)
realurl = pagefile.geturl()
pagefile.close()
return (soup, realurl)
Example Usage
import urllib2, BeautifulSoup
def opensoup (url):
"""
returns (page, actualurl)
sets user_agent and resolves possible redirection
realurl maybe different than url in the case of a redirect
"""
request = urllib2.Request(url)
user_agent = "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) Gecko/20080418 Ubuntu/7.10 (gutsy) Firefox/2.0.0.14"
request.add_header("User-Agent", user_agent)
pagefile=urllib2.urlopen(request)
soup=BeautifulSoup.BeautifulSoup(pagefile)
realurl = pagefile.geturl()
pagefile.close()
return (soup, realurl)
(soup, url) = opensoup("http://pzwart.wdka.hro.nl")
# print soup.prettify()
for link in soup.findAll("a"):
print link
Replace the contents of a tag
A function to replace the contents of a tag:
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup("<ul><li>one</li><li>two</li></ul>")
def setcontents (tag, val):
# remove previous contents
for c in tag.contents:
c.extract()
# insert the new
tag.insert(0, val)
items = soup.findAll("li")
for item in items:
setcontents(item, "foo")
print soup.prettify()
Wrap one (existing) tag inside of another (newly created) tag
import BeautifulSoup
def wraptag (tag, wrapper):
# <tag>contents</tag> ==> <wrapper><tag>contents</tag></wrapper>
tagIndex = tag.parent.contents.index(tag)
tag.parent.insert(tagIndex, wrapper)
wrapper.insert(0, tag)
# TEST CODE
soup = BeautifulSoup.BeautifulSoup("<ul><li>one</li><li>two</li></ul>")
items = soup.findAll("li")
for item in items:
div = BeautifulSoup.Tag(soup, "div")
wraptag(item, div)
print soup.prettify()
Code Questions
When "absolutizing", how to patch url's in stylesheets.