User:Eleanorg/2.1/first image grabber
Learning how to parse webpages with urlilb in python, so that I can create dynamic pages made up of content from elsewhere. This may be terribly 90s but I am still utterly fascinated by the implications of this basic technology: a 'transparent' webpage that acts only as a portal, channeling others.
Where I'm going with it: the ambivalent status of an aggregator - 'curating' content (gaining power over external content) while simultaneously composed of that very content (surrendering power to external content)
View a simple example here - all the images from my website's homepage, layered on top of each other.
code
#!/usr/bin/python
#-*- coding:utf-8 -*-
# based on a tutorial at http://www.boddie.org.uk/python/HTML.html
# using his code from here: http://www.boddie.org.uk/python/downloads/HTML1.py
import urllib, sgmllib
class MyParser(sgmllib.SGMLParser):
"A simple parser class."
def parse(self, s):
"Parse the given string 's'."
self.feed(s)
self.close()
def __init__(self, verbose=0):
"Initialise an object, passing 'verbose' to the superclass."
sgmllib.SGMLParser.__init__(self, verbose)
self.images = []
def start_img(self, attributes): # the name you give this function is important: it must be called start_tagname (eg, start_img or start_a)
"Process a hyperlink and its 'attributes'."
for name, value in attributes:
if name == "src": # this is the important line, where we specify what type of attribute to look for in the tag. currently only works with href :-(
self.images.append(value)
def get_images(self):
"Return the list of images."
return self.images
url = "http://eleanorg.org"
site = urllib.urlopen(url)
html = site.read()
parser = MyParser()
parser.parse(html)
# print html
# print parser.get_images() # this prints the whole array of 'src' values found (ie, the image urls)
# if we want to display the images we've found, we can reproduce them using the values from our 'images' array:
print "Content-Type: text/html"
print
print """
<!DOCTYPE html>
<html>
<head>
<title></title>
<style type="text/css">
img { position: absolute; top:100px; left:100px; opacity: 0.3; width:500px; }
</style>
</head>
<body>"""
imageList = parser.get_images() # make the results of parser.get_images() into an iterable variable
for item in imageList:
print "<img src=' " + item + "' />"
print """
</body>
</html>"""