User:Eleanorg/2.1/first image grabber

From XPUB & Lens-Based wiki

Learning how to parse webpages with urlilb in python, so that I can create dynamic pages made up of content from elsewhere. This may be terribly 90s but I am still utterly fascinated by the implications of this basic technology: a 'transparent' webpage that acts only as a portal, channeling others.

Where I'm going with it: the ambivalent status of an aggregator - 'curating' content (gaining power over external content) while simultaneously composed of that very content (surrendering power to external content)

View a simple example here - all the images from my website's homepage, layered on top of each other.

code

#!/usr/bin/python
#-*- coding:utf-8 -*-


# based on a tutorial at http://www.boddie.org.uk/python/HTML.html
# using his code from here: http://www.boddie.org.uk/python/downloads/HTML1.py 


import urllib, sgmllib

class MyParser(sgmllib.SGMLParser):
    "A simple parser class."

    def parse(self, s):
        "Parse the given string 's'."
        self.feed(s)
        self.close()

    def __init__(self, verbose=0):
        "Initialise an object, passing 'verbose' to the superclass."

        sgmllib.SGMLParser.__init__(self, verbose)
        self.images = []

    def start_img(self, attributes):            # the name you give this function is important: it must be called start_tagname (eg, start_img or start_a) 
        "Process an image tag and its 'attributes'."

        for name, value in attributes:
            if name == "src":                   # this is the important line, where we specify what type of attribute to look for in the tag
                self.images.append(value)  # if attribute 'src' is found, append content of the tag to array 'images'

    def get_images(self):
        "Return the list of images."

        return self.images

url = "http://eleanorg.org"
site = urllib.urlopen(url)
html = site.read()

parser = MyParser()
parser.parse(html)

# print html
# print parser.get_images()                     # this prints the whole array of 'src' values found (ie, the image urls)

# if we want to display the images we've found, we can reproduce them using the values from our 'images' array:

print "Content-Type: text/html"
print
print """
<!DOCTYPE html>
<html>
  <head>
    <title></title>
    <style type="text/css">
        img { position: absolute; top:100px; left:100px; opacity: 0.3; width:500px; }
  </style>
 
  </head>
 
<body>"""

imageList = parser.get_images()                 # make the results of parser.get_images() into an iterable variable
for item in imageList:
        print "<img src=' " + item + "' />"


print """
</body>
</html>"""