BeautifulSoup: Difference between revisions
Line 5: | Line 5: | ||
== Code Examples == | == Code Examples == | ||
A useful all-purpose way of reading the contents of a webpage (nb: it makes use of the urllib2 module): | |||
NB: the urllib2 module is used to connect to a page -- it deals with eventual redirection to a different page location, which is why the function returns both the soup and the "actual" URL. This "realurl" should be used in any subsequent substitution / absolutizing of URLs inside the page as this is where the page actually is. | |||
<source lang="python"> | |||
def opensoup (url): | |||
""" | |||
returns (page, actualurl) | |||
sets user_agent and resolves possible redirection | |||
realurl maybe different than url in the case of a redirect | |||
""" | |||
request = urllib2.Request(url) | |||
user_agent = "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) Gecko/20080418 Ubuntu/7.10 (gutsy) Firefox/2.0.0.14" | |||
request.add_header("User-Agent", user_agent) | |||
pagefile=urllib2.urlopen(request) | |||
soup=BeautifulSoup.BeautifulSoup(pagefile) | |||
realurl = pagefile.geturl() | |||
pagefile.close() | |||
return (soup, realurl) | |||
</source> | |||
A function to replace the contents of a tag: | A function to replace the contents of a tag: |
Revision as of 13:50, 23 August 2008
Beautiful Soup is a Python library for manipulating HTML pages.
Code Examples
A useful all-purpose way of reading the contents of a webpage (nb: it makes use of the urllib2 module): NB: the urllib2 module is used to connect to a page -- it deals with eventual redirection to a different page location, which is why the function returns both the soup and the "actual" URL. This "realurl" should be used in any subsequent substitution / absolutizing of URLs inside the page as this is where the page actually is.
def opensoup (url):
"""
returns (page, actualurl)
sets user_agent and resolves possible redirection
realurl maybe different than url in the case of a redirect
"""
request = urllib2.Request(url)
user_agent = "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) Gecko/20080418 Ubuntu/7.10 (gutsy) Firefox/2.0.0.14"
request.add_header("User-Agent", user_agent)
pagefile=urllib2.urlopen(request)
soup=BeautifulSoup.BeautifulSoup(pagefile)
realurl = pagefile.geturl()
pagefile.close()
return (soup, realurl)
A function to replace the contents of a tag:
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup("<ul><li>one</li><li>two</li></ul>")
def setcontents (tag, val):
# remove previous contents
for c in tag.contents:
c.extract()
# insert the new
tag.insert(0, val)
items = soup.findAll("li")
for item in items:
setcontents(item, "foo")
print soup.prettify()
A function to wrap one tag inside of another one.
import BeautifulSoup
def wraptag (tag, wrapper):
# wraps tag with wrapper
tagIndex = tag.parent.contents.index(tag)
tag.parent.insert(tagIndex, wrapper)
wrapper.insert(0, tag)
soup = BeautifulSoup.BeautifulSoup("<ul><li>one</li><li>two</li></ul>")
items = soup.findAll("li")
for item in items:
div = BeautifulSoup.Tag(soup, "div")
wraptag(item, div)
print soup.prettify()
Code Questions
- absolutize function needs ability to patch url's in stylesheets.