Web scraping with Python

From XPUB & Lens-Based wiki
Revision as of 15:31, 26 May 2014 by Michael Murtaugh (talk | contribs) (Created page with "== Tools == * python * html5lib * [https://docs.python.org/2/library/xml.etree.elementtree.html ElementTree] part of the standard python library == Scraping dmoz.org ...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Tools

Scraping dmoz.org

Example 1: Pulling the URLs + textual descriptions from a single page

Consider a single page on dmoz, such as:

http://www.dmoz.org/Science/Astronomy/

If you look into the source, we can see the structure around the URLs listed at the bottom of the page:

<ul style="margin-left:0;" class="directory-url">
<li>
<a class="listinglink" href="http://www.absoluteastronomy.com/">Absolute Astronomy</a> 
- Facts and statistical information about planets, moons, constellations, stars, galaxies, and Messier objects.
<div class="flag"><a href="/public/flag?cat=Science%2FAstronomy&amp;url=http%3A%2F%2Fwww.absoluteastronomy.com%2F"><img title="report an issue with this listing" alt="[!]" src="/img/flag.png"></a></div>
</li>
from __future__ import print_function
import urllib2, urlparse, html5lib, sys

url = sys.argv[1]

f = urllib2.urlopen(url)
src = f.read()
tree = html5lib.parse(src, namespaceHTMLElements=False)

for div in tree.findall(".//ul"):
    if "directory-url" in div.get("class", "").split():
        for li in div.findall("li"):
            for a in li.findall("a"):
                if "listinglink" in a.get("class", "").split():
                    url = a.get("href")
                    description = a.tail.strip().strip("-").strip()
                    print (url)
                    print ("\t"+description.encode("utf-8"))
                    print ()