Scraping web pages with python: Difference between revisions
No edit summary |
|||
(10 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
See also: [[Filtering HTML with python]] | |||
== Using html5lib + elementtree == | == Using html5lib + elementtree == | ||
Back in the day, working with HTML pages with python's standard library was often frustrating as most web pages "in the wild" didn't conform to the rigid restrictions of XML. As a result projects like Beautiful Soup were created that made working with HTML quite easy. Happily the lessons learned from BeautifulSoup have incorporated into modern libraries like html5lib | Back in the day, working with HTML pages with python's standard library was often frustrating as most web pages "in the wild" didn't conform to the rigid restrictions of XML. As a result projects like Beautiful Soup were created that made working with HTML quite easy. Happily the lessons learned from BeautifulSoup have incorporated into modern libraries like html5lib. At the same time, some of the ugliness of working with XML via standard interfaces like SAX were improved with Fredrick Lundh's work on [http://effbot.org/zone/element-index.htm ElementTree] which is part of python's [https://docs.python.org/3.7/library/xml.etree.elementtree.html?highlight=elementtree standard library]. | ||
=== Find all the links (a) on the front page of nytimes.com and print their href and label === | === Find all the links (a) on the front page of nytimes.com and print their href and label === | ||
Line 18: | Line 20: | ||
print ("Link", "Label") | print ("Link", "Label") | ||
for a in t.findall('.//a[@href]'): | for a in t.findall('.//a[@href]'): | ||
# | # Absolutize any relative links with urljoin | ||
href = urljoin(url, a.attrib.get('href')) | href = urljoin(url, a.attrib.get('href')) | ||
print(href, a.text) # link, label | print(href, a.text) # link, label | ||
</source> | |||
=== Print the contents of a document or particular tag === | |||
<source lang="python"> | |||
print(ET.tostring(sometag, encoding='unicode')) | |||
</source> | </source> | ||
Line 31: | Line 39: | ||
=== Generic page scraping === | === Generic page scraping === | ||
The ''.iter'' function lets you scan through all the elements on a page and run code on them to filter them in whatever way you want. The ''.tag'' gives you access to the tagname (lowercase), and ''.text'' to the text contents of the tag. | |||
<source lang="python"> | |||
import html5lib | |||
import xml.etree.ElementTree as ET | |||
from urllib.request import urlopen | |||
from urllib.parse import urljoin | |||
url = "https://nytimes.com/" | |||
with urlopen(url) as f: | |||
t = html5lib.parse(f, namespaceHTMLElements=False) | |||
for x in t.iter(): | |||
if x.text != None and "trump" in x.text.lower() and x.tag != "script": | |||
print (x.tag, x.text) | |||
</source> | |||
=== Setting the User Agent === | |||
Some web servers block bots by simply rejecting requests that don't identify themselves via the "user agent" http header. This is easy enough to set (aka "spoof"). | |||
See: https://stackoverflow.com/questions/24226781/changing-user-agent-in-python-3-for-urrlib-request-urlopen | |||
<source lang="python"> | |||
import urllib.request | |||
req = urllib.request.Request( | |||
"http://nytimes.com", | |||
headers={ | |||
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36' | |||
}) | |||
f = urllib.request.urlopen(req) | |||
print (f.code) | |||
</source> | |||
=== A spider === | === A spider === | ||
<source lang="python"> | <source lang="python"> | ||
import html5lib | import html5lib, sys | ||
import xml.etree.ElementTree as ET | import xml.etree.ElementTree as ET | ||
from urllib.request import urlopen | from urllib.request import urlopen | ||
from urllib.parse import urljoin | from urllib.parse import urljoin | ||
from urllib.error import HTTPError | from urllib.error import HTTPError | ||
url = 'https://news.bbc.co.uk' | url = 'https://news.bbc.co.uk' | ||
todo = [url] | todo = [url] | ||
seen = set() | seen = set() | ||
printed = set() | |||
while todo: | while todo: | ||
url = todo[0] | url = todo[0] | ||
todo = todo[1:] | todo = todo[1:] | ||
print('Scraping', url) | print('Scraping', url, file=sys.stderr) | ||
try: | try: | ||
Line 54: | Line 98: | ||
t = html5lib.parse(f, namespaceHTMLElements=False) | t = html5lib.parse(f, namespaceHTMLElements=False) | ||
seen.add(url) | seen.add(url) | ||
for a in t.findall('.//a[@href]'): | for a in t.findall('.//a[@href]'): | ||
href = urljoin(url, a.attrib.get('href')) | href = urljoin(url, a.attrib.get('href')) | ||
#print(ET.tostring(a, encoding='unicode')) | #print(ET.tostring(a, encoding='unicode')) | ||
if href not in printed: | |||
text = a.text or '' | |||
print(href, text.strip()) # link, label | |||
printed.add(href) | |||
if href not in seen: | if href not in seen: | ||
todo.append(href) | todo.append(href) | ||
except HTTPError: | except HTTPError: | ||
print('Page not found!!111' | print('Page not found!!111', file=sys.stderr) | ||
</source> | </source> |
Latest revision as of 15:56, 23 May 2020
See also: Filtering HTML with python
Using html5lib + elementtree
Back in the day, working with HTML pages with python's standard library was often frustrating as most web pages "in the wild" didn't conform to the rigid restrictions of XML. As a result projects like Beautiful Soup were created that made working with HTML quite easy. Happily the lessons learned from BeautifulSoup have incorporated into modern libraries like html5lib. At the same time, some of the ugliness of working with XML via standard interfaces like SAX were improved with Fredrick Lundh's work on ElementTree which is part of python's standard library.
Find all the links (a) on the front page of nytimes.com and print their href and label
import html5lib
import xml.etree.ElementTree as ET
from urllib.request import urlopen
from urllib.parse import urljoin
url = "https://nytimes.com/"
with urlopen(url) as f:
t = html5lib.parse(f, namespaceHTMLElements=False)
print ("Link", "Label")
for a in t.findall('.//a[@href]'):
# Absolutize any relative links with urljoin
href = urljoin(url, a.attrib.get('href'))
print(href, a.text) # link, label
Print the contents of a document or particular tag
print(ET.tostring(sometag, encoding='unicode'))
Scraping from a local file
with open("myfile.html") as f:
t = html5lib.parse(f, namespaceHTMLElements=False)
Generic page scraping
The .iter function lets you scan through all the elements on a page and run code on them to filter them in whatever way you want. The .tag gives you access to the tagname (lowercase), and .text to the text contents of the tag.
import html5lib
import xml.etree.ElementTree as ET
from urllib.request import urlopen
from urllib.parse import urljoin
url = "https://nytimes.com/"
with urlopen(url) as f:
t = html5lib.parse(f, namespaceHTMLElements=False)
for x in t.iter():
if x.text != None and "trump" in x.text.lower() and x.tag != "script":
print (x.tag, x.text)
Setting the User Agent
Some web servers block bots by simply rejecting requests that don't identify themselves via the "user agent" http header. This is easy enough to set (aka "spoof").
import urllib.request
req = urllib.request.Request(
"http://nytimes.com",
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
})
f = urllib.request.urlopen(req)
print (f.code)
A spider
import html5lib, sys
import xml.etree.ElementTree as ET
from urllib.request import urlopen
from urllib.parse import urljoin
from urllib.error import HTTPError
url = 'https://news.bbc.co.uk'
todo = [url]
seen = set()
printed = set()
while todo:
url = todo[0]
todo = todo[1:]
print('Scraping', url, file=sys.stderr)
try:
with urlopen(url) as f:
t = html5lib.parse(f, namespaceHTMLElements=False)
seen.add(url)
for a in t.findall('.//a[@href]'):
href = urljoin(url, a.attrib.get('href'))
#print(ET.tostring(a, encoding='unicode'))
if href not in printed:
text = a.text or ''
print(href, text.strip()) # link, label
printed.add(href)
if href not in seen:
todo.append(href)
except HTTPError:
print('Page not found!!111', file=sys.stderr)