Scraping web pages with python: Difference between revisions

From XPUB & Lens-Based wiki
No edit summary
No edit summary
 
(18 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Open a URL with lxml ==
See also: [[Filtering HTML with python]]
 
== Using html5lib + elementtree ==
 
Back in the day, working with HTML pages with python's standard library was often frustrating as most web pages "in the wild" didn't conform to the rigid restrictions of XML. As a result projects like Beautiful Soup were created that made working with HTML quite easy. Happily the lessons learned from BeautifulSoup have incorporated into modern libraries like html5lib. At the same time, some of the ugliness of working with XML via standard interfaces like SAX were improved with Fredrick Lundh's work on [http://effbot.org/zone/element-index.htm ElementTree] which is part of python's [https://docs.python.org/3.7/library/xml.etree.elementtree.html?highlight=elementtree standard library].
 
=== Find all the links (a) on the front page of nytimes.com and print their href and label ===
 
<source lang="python">
<source lang="python">
import html5lib, urllib2
import html5lib
import xml.etree.ElementTree as ET
from urllib.request import urlopen
from urllib.parse import urljoin
 


def get (url):
url = "https://nytimes.com/"
    htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)
with urlopen(url) as f:
    request = urllib2.Request(url)
     t = html5lib.parse(f, namespaceHTMLElements=False)
    request.add_header("User-Agent", "Mozilla/5.0 (X11; U; Linux x86_64; fr; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5.5")
    f=urllib2.urlopen(request)
     page = htmlparser.parse(f)
    return page
</source>


== Parsing with html5lib ==
print ("Link", "Label")
for a in t.findall('.//a[@href]'):
    # Absolutize any relative links with urljoin
    href = urljoin(url, a.attrib.get('href'))
    print(href, a.text)  # link, label


The html5lib parser is code that turns the source text of an HTML page
</source>
into a structured object, allowing, for instance, to use CSS selectors
or xpath expressions to select/extract portions of a page


You can use xpath expressions:
=== Print the contents of a document or particular tag ===


<source lang="python">
<source lang="python">
import html5lib, lxml
print(ET.tostring(sometag, encoding='unicode'))
</source>


htmlsource="<html><body><p>Example page.</p><p>More stuff.</p></body></html>"
=== Scraping from a local file ===
htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)
<source lang="python">
page = htmlparser.parse(htmlsource)
with open("myfile.html") as f:
p = page.xpath("/html/body/p[2]")
     t = html5lib.parse(f, namespaceHTMLElements=False)
if p:
     p = p[0]
    print "".join([t for t in p.itertext()])
</source>
</source>


outputs:
=== Generic page scraping ===
More stuff.
The ''.iter'' function lets you scan through all the elements on a page and run code on them to filter them in whatever way you want. The ''.tag'' gives you access to the tagname (lowercase), and ''.text'' to the text contents of the tag.


Also CSS selectors are possible:
<source lang="python">
import html5lib
import xml.etree.ElementTree as ET
from urllib.request import urlopen
from urllib.parse import urljoin


<source lang="python">
import html5lib, lxml, lxml.cssselect


htmlsource="<html><body><p>Example page.</p><p>More stuff.</p></body></html>"
url = "https://nytimes.com/"
htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)
with urlopen(url) as f:
page = htmlparser.parse(htmlsource)
    t = html5lib.parse(f, namespaceHTMLElements=False)
selector = lxml.cssselect.CSSSelector("p")
for p in selector(page):
    print "-"*20
    print "".join([t for t in p.itertext()])


for x in t.iter():
    if x.text != None and "trump" in x.text.lower() and x.tag != "script":
        print (x.tag, x.text)
</source>
</source>


--------------------
=== Setting the User Agent ===
Example page.
Some web servers block bots by simply rejecting requests that don't identify themselves via the "user agent" http header. This is easy enough to set (aka "spoof").
--------------------
More stuff.


== Function that takes a URL + xpath ==
See: https://stackoverflow.com/questions/24226781/changing-user-agent-in-python-3-for-urrlib-request-urlopen
 
NB the function returns a LIST of matching fragments (since xpaths can potentially match multiple things). So, if you expect only one result, use [0] to pull off the first (single) item. lxml.etree.tostring is used to re-serialize the result.


<source lang="python">
<source lang="python">
import urllib2, html5lib, lxml, lxml.etree
import urllib.request
req = urllib.request.Request(
def getXpath (url, xpath):
    "http://nytimes.com",  
    htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)
     headers={
     request = urllib2.Request(url)
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
    request.add_header("User-Agent", "Mozilla/5.0 (X11; U; Linux x86_64; fr; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5.5")
     })
    f=urllib2.urlopen(request)
f = urllib.request.urlopen(req)
   
    page = htmlparser.parse(f)
     return page.xpath(xpath)
 
if __name__ == "__main__":
    url = "http://www.jabberwocky.com/carroll/walrus.html"
    xpath = "/html/body/p[6]"
    print lxml.etree.tostring(getXpath(url, xpath)[0])


print (f.code)
</source>
</source>


== Function that takes a URL + CSS selector ==
=== A spider ===


<source lang="python">
<source lang="python">
import html5lib, lxml, lxml.cssselect
import html5lib, sys
import xml.etree.ElementTree as ET
def getCSS (url, selector):
from urllib.request import urlopen
    htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)
from urllib.parse import urljoin
    request = urllib2.Request(url)
from urllib.error import HTTPError
    request.add_header("User-Agent", "Mozilla/5.0 (X11; U; Linux x86_64; fr; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5.5")
    f=urllib2.urlopen(request)
   
    page = htmlparser.parse(f)
    selector = lxml.cssselect.CSSSelector(selector)
    return list(selector(page))


# TEST
if __name__ == "__main__":
    url = "http://www.jabberwocky.com/carroll/walrus.html"
    print lxml.etree.tostring(getCSS(url, "p")[0])


</source>
url = 'https://news.bbc.co.uk'
todo = [url]
seen = set()
printed = set()


while todo:
    url = todo[0]
    todo = todo[1:]
    print('Scraping', url, file=sys.stderr)
    try:
        with urlopen(url) as f:
            t = html5lib.parse(f, namespaceHTMLElements=False)
            seen.add(url)
 
        for a in t.findall('.//a[@href]'):
            href = urljoin(url, a.attrib.get('href'))
            #print(ET.tostring(a, encoding='unicode'))
            if href not in printed:
                text = a.text or ''
                print(href, text.strip())  # link, label
                printed.add(href)
            if href not in seen:
                todo.append(href)
    except HTTPError:
        print('Page not found!!111', file=sys.stderr)


[[Category: Cookbook]] [[Category: Scraping]] [[Category: xpath]] [[Category: python]] [[Category: lxml]]
</source>

Latest revision as of 15:56, 23 May 2020

See also: Filtering HTML with python

Using html5lib + elementtree

Back in the day, working with HTML pages with python's standard library was often frustrating as most web pages "in the wild" didn't conform to the rigid restrictions of XML. As a result projects like Beautiful Soup were created that made working with HTML quite easy. Happily the lessons learned from BeautifulSoup have incorporated into modern libraries like html5lib. At the same time, some of the ugliness of working with XML via standard interfaces like SAX were improved with Fredrick Lundh's work on ElementTree which is part of python's standard library.

Find all the links (a) on the front page of nytimes.com and print their href and label

import html5lib
import xml.etree.ElementTree as ET
from urllib.request import urlopen
from urllib.parse import urljoin


url = "https://nytimes.com/"
with urlopen(url) as f:
    t = html5lib.parse(f, namespaceHTMLElements=False)

print ("Link", "Label")
for a in t.findall('.//a[@href]'):
    # Absolutize any relative links with urljoin
    href = urljoin(url, a.attrib.get('href'))
    print(href, a.text)  # link, label

Print the contents of a document or particular tag

print(ET.tostring(sometag, encoding='unicode'))

Scraping from a local file

with open("myfile.html") as f:
    t = html5lib.parse(f, namespaceHTMLElements=False)

Generic page scraping

The .iter function lets you scan through all the elements on a page and run code on them to filter them in whatever way you want. The .tag gives you access to the tagname (lowercase), and .text to the text contents of the tag.

import html5lib
import xml.etree.ElementTree as ET
from urllib.request import urlopen
from urllib.parse import urljoin


url = "https://nytimes.com/"
with urlopen(url) as f:
    t = html5lib.parse(f, namespaceHTMLElements=False)

for x in t.iter():
    if x.text != None and "trump" in x.text.lower() and x.tag != "script":
        print (x.tag, x.text)

Setting the User Agent

Some web servers block bots by simply rejecting requests that don't identify themselves via the "user agent" http header. This is easy enough to set (aka "spoof").

See: https://stackoverflow.com/questions/24226781/changing-user-agent-in-python-3-for-urrlib-request-urlopen

import urllib.request
req = urllib.request.Request(
    "http://nytimes.com", 
    headers={
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
    })
f = urllib.request.urlopen(req)

print (f.code)

A spider

import html5lib, sys
import xml.etree.ElementTree as ET
from urllib.request import urlopen
from urllib.parse import urljoin
from urllib.error import HTTPError


url = 'https://news.bbc.co.uk'
todo = [url]
seen = set()
printed = set()

while todo:
    url = todo[0]
    todo = todo[1:]
    print('Scraping', url, file=sys.stderr)
 
    try:
        with urlopen(url) as f:
            t = html5lib.parse(f, namespaceHTMLElements=False)
            seen.add(url)
  
        for a in t.findall('.//a[@href]'):
            href = urljoin(url, a.attrib.get('href'))
            #print(ET.tostring(a, encoding='unicode'))
            if href not in printed:
                text = a.text or ''
                print(href, text.strip())  # link, label
                printed.add(href)
            if href not in seen:
                todo.append(href)
    except HTTPError:
        print('Page not found!!111', file=sys.stderr)