2008 3.07: Difference between revisions
(→Code) |
No edit summary |
||
Line 1: | Line 1: | ||
= | = Radical Browsers = | ||
Walk into any of your more "service oriented" stores in the United States and you're likely to be quickly "greeted" by a salesperson along the lines of "is there something we can help you with today?" A simple, and oft-used, non-commital response is to say, "No, thanks. I'm just browsing". | |||
The original conception of the world wide web was one that supported a variety of means of viewing and interacting with online content. By digging into the underlying network mechanisms, protocols, and markup languages it's possible to create radically different kinds of "browsing" of the material made available via the world wide web. | A software "browser" is for many quite possibly their most frequently used piece of software. As a result of it's very persistence and ubiquity, the browser as a piece of software fades into the background, becoming a "natural" and "neutral" part of one's daily (computing) experience. | ||
The original conception of the world wide web was one that supported a variety of means of viewing and interacting with online content. | |||
By digging into the underlying network mechanisms, protocols, and markup languages it's possible to create radically different kinds of "browsing" of the material made available via the world wide web. | |||
some examples | some examples |
Revision as of 00:58, 31 May 2008
Radical Browsers
Walk into any of your more "service oriented" stores in the United States and you're likely to be quickly "greeted" by a salesperson along the lines of "is there something we can help you with today?" A simple, and oft-used, non-commital response is to say, "No, thanks. I'm just browsing".
A software "browser" is for many quite possibly their most frequently used piece of software. As a result of it's very persistence and ubiquity, the browser as a piece of software fades into the background, becoming a "natural" and "neutral" part of one's daily (computing) experience.
The original conception of the world wide web was one that supported a variety of means of viewing and interacting with online content.
By digging into the underlying network mechanisms, protocols, and markup languages it's possible to create radically different kinds of "browsing" of the material made available via the world wide web.
some examples
Page mashups with Python & Beautiful Soup
Some useful tools built into Python:
- urllib2
- urlparse
Issue with urllib and wikipedia (Setting User-Agent to "pretend" to be a "real" browser):
Code
#!/usr/bin/python
import BeautifulSoup, cgi
import urllib, urllib2, urlparse
import cgitb; cgitb.enable()
inputs = cgi.FieldStorage()
pageurl = inputs.getvalue("url", "http://news.bbc.co.uk")
# setting the user-agent
request = urllib2.Request(pageurl)
user_agent = "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) Gecko/20080418 Ubuntu/7.10 (gutsy) Firefox/2.0.0.14"
request.add_header("User-Agent", user_agent)
pagefile=urllib2.urlopen(request)
page=BeautifulSoup.BeautifulSoup(pagefile)
realurl = pagefile.geturl()
def scriptURL():
""" returns: current URL without query """
httpHost = os.environ.get('HTTP_HOST', os.environ.get('SERVER_NAME'))
scriptName = os.environ.get('SCRIPT_NAME')
return "http://" + httpHost + scriptName
this = scriptURL()
print "Content-type: text/html"
print
# make absolute all href's
for r in page.findAll(True, {'href': True}):
href = r['href']
if not href.lower().startswith("http"):
r['href'] = urlparse.urljoin(realurl, href)
# make absolute all src's
for r in page.findAll(True, {'src': True}):
href = r['src']
if not href.lower().startswith("http"):
r['src'] = urlparse.urljoin(realurl, href)
title = ""
try:
title = page.title.string
except AttributeError:
pass
print "<h1>%s</h1>" % title
print "<h2>%s</h2>" % (realurl)
print "<ol>"
links=page.findAll("a")
for l in links:
if not l.has_key("href"): continue
href = l['href']
if not href.lower().startswith("http"):
href = urlparse.urljoin(realurl, href)
label = l.renderContents()
href = this + "?url=" + urllib.quote(href, "")
print """<li><a href="%s">%s</a></li>""" % (href, label)
print "</ol>"