2008 3.07: Difference between revisions

From XPUB & Lens-Based wiki
No edit summary
No edit summary
Line 23: Line 23:




== Exercises ==
== Code ==
 
<source lang="python">
#!/usr/bin/python
 
import BeautifulSoup, cgi
import urllib, urllib2, urlparse
import cgitb; cgitb.enable()
 
inputs = cgi.FieldStorage()
pageurl = inputs.getvalue("url", "http://news.bbc.co.uk")
pagefile=urllib2.urlopen(pageurl)
page=BeautifulSoup.BeautifulSoup(pagefile)
realurl = pagefile.geturl()
 
print "Content-type: text/html"
print
 
# make absolute all href's
for r in page.findAll(True, {'href': True}):
href = r['href']
if not href.lower().startswith("http"):
r['href'] = urlparse.urljoin(realurl, href)
# make absolute all src's
for r in page.findAll(True, {'src': True}):
href = r['src']
if not href.lower().startswith("http"):
r['src'] = urlparse.urljoin(realurl, href)
 
title = ""
try:
title = page.title.string
except AttributeError:
pass
 
print "<h1>%s</h1>" % title
print "<h2>%s</h2>" % (realurl)
print "<ol>"
links=page.findAll("a")
for l in links:
if not l.has_key("href"): continue
href = l['href']
if not href.lower().startswith("http"):
href = urlparse.urljoin(realurl, href)
label = l.renderContents()
href = "?url=" + urllib.quote(href, "")
print """<li><a href="%s">%s</a></li>""" % (href, label)
 
print "</ol>"
</source>

Revision as of 13:56, 29 May 2008

Just Browsing?

A software "browser" is for many quite possibly the most used piece of software. As a result of it's very persistence and ubiquity, the software may fade into the background, becoming a "natural" and "neutral" part of one's daily (computing) experience.

The original conception of the world wide web was one that supported a variety of means of viewing and interacting with online content. By digging into the underlying network mechanisms, protocols, and markup languages it's possible to create radically different kinds of "browsing" of the material made available via the world wide web.

some examples

Issue with urllib and wikipedia (Setting User-Agent to "pretend" to be a "real" browser):

Page mashups with Python & Beautiful Soup

Some useful tools built into Python:


Code

#!/usr/bin/python

import BeautifulSoup, cgi
import urllib, urllib2, urlparse
import cgitb; cgitb.enable()

inputs = cgi.FieldStorage()
pageurl = inputs.getvalue("url", "http://news.bbc.co.uk")
pagefile=urllib2.urlopen(pageurl)
page=BeautifulSoup.BeautifulSoup(pagefile)
realurl = pagefile.geturl()

print "Content-type: text/html"
print

# make absolute all href's
for r in page.findAll(True, {'href': True}):
	href = r['href']
	if not href.lower().startswith("http"):
		r['href'] = urlparse.urljoin(realurl, href)
# make absolute all src's
for r in page.findAll(True, {'src': True}):
	href = r['src']
	if not href.lower().startswith("http"):
		r['src'] = urlparse.urljoin(realurl, href)

title = ""
try:
	title = page.title.string
except AttributeError:
	pass

print "<h1>%s</h1>" % title
print "<h2>%s</h2>" % (realurl)
print "<ol>"
links=page.findAll("a")
for l in links:
	if not l.has_key("href"): continue
	href = l['href']
	if not href.lower().startswith("http"):
		href = urlparse.urljoin(realurl, href)
	label = l.renderContents()
	href = "?url=" + urllib.quote(href, "")
	
	print """<li><a href="%s">%s</a></li>""" % (href, label)

print "</ol>"