User:Eleanorg/2.1/first image grabber: Difference between revisions
(Created page with "Learning how to parse webpages with urlilb in python, so that I can create dynamic pages made up of content from elsewhere. This may be terribly 90s but I am still utterly fascin...") |
No edit summary |
||
(8 intermediate revisions by the same user not shown) | |||
Line 3: | Line 3: | ||
Where I'm going with it: the ambivalent status of an aggregator - 'curating' content (gaining power over external content) while simultaneously composed of that very content (surrendering power to external content) | Where I'm going with it: the ambivalent status of an aggregator - 'curating' content (gaining power over external content) while simultaneously composed of that very content (surrendering power to external content) | ||
View a simple example | View a simple example [http://pzwart3.wdka.hro.nl/~egreenhalgh/cgi-bin/urllib/sgmlImageGrabber.cgi here] - all the images from my website's homepage, layered on top of each other. | ||
==code== | ==code== | ||
Line 9: | Line 9: | ||
#!/usr/bin/python | #!/usr/bin/python | ||
#-*- coding:utf-8 -*- | #-*- coding:utf-8 -*- | ||
# based on a tutorial at http://www.boddie.org.uk/python/HTML.html | |||
# using his code from here: http://www.boddie.org.uk/python/downloads/HTML1.py | |||
import urllib, sgmllib | import urllib, sgmllib | ||
Line 27: | Line 32: | ||
def start_img(self, attributes): # the name you give this function is important: it must be called start_tagname (eg, start_img or start_a) | def start_img(self, attributes): # the name you give this function is important: it must be called start_tagname (eg, start_img or start_a) | ||
"Process | "Process an image tag and its 'attributes'." | ||
for name, value in attributes: | for name, value in attributes: | ||
if name == "src": # this is the important line, where we specify what type of attribute to look for in the tag | if name == "src": # this is the important line, where we specify what type of attribute to look for in the tag | ||
self.images.append(value) | self.images.append(value) # if attribute 'src' is found, append content of the tag to array 'images' | ||
def get_images(self): | def get_images(self): |
Latest revision as of 04:41, 11 November 2012
Learning how to parse webpages with urlilb in python, so that I can create dynamic pages made up of content from elsewhere. This may be terribly 90s but I am still utterly fascinated by the implications of this basic technology: a 'transparent' webpage that acts only as a portal, channeling others.
Where I'm going with it: the ambivalent status of an aggregator - 'curating' content (gaining power over external content) while simultaneously composed of that very content (surrendering power to external content)
View a simple example here - all the images from my website's homepage, layered on top of each other.
code
#!/usr/bin/python
#-*- coding:utf-8 -*-
# based on a tutorial at http://www.boddie.org.uk/python/HTML.html
# using his code from here: http://www.boddie.org.uk/python/downloads/HTML1.py
import urllib, sgmllib
class MyParser(sgmllib.SGMLParser):
"A simple parser class."
def parse(self, s):
"Parse the given string 's'."
self.feed(s)
self.close()
def __init__(self, verbose=0):
"Initialise an object, passing 'verbose' to the superclass."
sgmllib.SGMLParser.__init__(self, verbose)
self.images = []
def start_img(self, attributes): # the name you give this function is important: it must be called start_tagname (eg, start_img or start_a)
"Process an image tag and its 'attributes'."
for name, value in attributes:
if name == "src": # this is the important line, where we specify what type of attribute to look for in the tag
self.images.append(value) # if attribute 'src' is found, append content of the tag to array 'images'
def get_images(self):
"Return the list of images."
return self.images
url = "http://eleanorg.org"
site = urllib.urlopen(url)
html = site.read()
parser = MyParser()
parser.parse(html)
# print html
# print parser.get_images() # this prints the whole array of 'src' values found (ie, the image urls)
# if we want to display the images we've found, we can reproduce them using the values from our 'images' array:
print "Content-Type: text/html"
print
print """
<!DOCTYPE html>
<html>
<head>
<title></title>
<style type="text/css">
img { position: absolute; top:100px; left:100px; opacity: 0.3; width:500px; }
</style>
</head>
<body>"""
imageList = parser.get_images() # make the results of parser.get_images() into an iterable variable
for item in imageList:
print "<img src=' " + item + "' />"
print """
</body>
</html>"""