Wikiwalker: Crawling wikipedia pages for images: Difference between revisions

From XPUB & Lens-Based wiki
Line 1: Line 1:
== Step 1: Extracting infobox images from a wikipedia page ==
== Step 1: Extracting infobox images from a wikipedia page ==


In this code, note the use of ElementTree's ''tostring'' function to convert a document element back into text. ''tostring'' has an optional method attibute with a number of interesting values: ''html'' and ''xml'' output html codes either loose (html) or strict (xml) the latter being useful if you want to feed the output into strict XML tools. Finally the ''text'' method outputs "text-only" effectively stripping any html tags, useful for when you want just the text.
In this code, note the use of ElementTree's ''[https://docs.python.org/2/library/xml.etree.elementtree.html#xml.etree.ElementTree.tostring tostring]'' function to convert a document element back into text. ''tostring'' has an optional method attibute with a number of interesting values: ''html'' and ''xml'' output html codes either loose (html) or strict (xml) the latter being useful if you want to feed the output into strict XML tools. Finally the ''text'' method outputs "text-only" effectively stripping any html tags, useful for when you want just the text.


<source lang="python">
<source lang="python">

Revision as of 17:01, 26 May 2014

Step 1: Extracting infobox images from a wikipedia page

In this code, note the use of ElementTree's tostring function to convert a document element back into text. tostring has an optional method attibute with a number of interesting values: html and xml output html codes either loose (html) or strict (xml) the latter being useful if you want to feed the output into strict XML tools. Finally the text method outputs "text-only" effectively stripping any html tags, useful for when you want just the text.

from __future__ import print_function
import urllib2, html5lib
from urlparse import urljoin
from xml.etree import ElementTree as ET

start = "http://en.wikipedia.org/wiki/J._D._Salinger"

todo = [start]
seen = set()

while len(todo) > 0:
    url, todo = todo[0], todo[1:]
    if url not in seen:
        f = urllib2.urlopen(url)
        print("VISITING", url)
        src = f.read()
        tree = html5lib.parse(src, namespaceHTMLElements=False)

        h1 = tree.find(".//h1")
        if h1 != None:
            # print("title", ET.tostring(h1, method="text"))
            print("title", ET.tostring(h1, method="html"))

        for table in tree.findall(".//table"):
            if "infobox" in table.get("class", "").split():
                for img in table.findall(".//img"):
                    src = img.get("src", "")
                    src = urljoin(url, src)
                    print(src)