Scraping: Difference between revisions
No edit summary |
|||
Line 15: | Line 15: | ||
== Scraping HTML in Python with html5lib + css selectors (2022) == | == Scraping HTML in Python with html5lib + css selectors (2022) == | ||
<source lang="python"> | |||
from xml.etree import ElementTree as ET | |||
import html5lib | |||
def textContent (elt): | |||
if elt.text != None: | |||
ret = elt.text | |||
else: | |||
ret = u"" | |||
return ret + u"".join([ET.tostring(x, method="text", encoding="utf8").decode("utf-8") for x in elt]) | |||
with open("index.html") as fin: | |||
t = html5lib.parse(fin, namespaceHTMLElements=False) | |||
</source> | |||
<source lang="python"> | |||
from cssselect import HTMLTranslator | |||
import elementpath | |||
def querySelector (t, selector): | |||
for elt in elementpath.select(t, HTMLTranslator().css_to_xpath(selector)): | |||
return elt | |||
def querySelectorAll (t, selector): | |||
for elt in elementpath.select(t, HTMLTranslator().css_to_xpath(selector)): | |||
yield elt | |||
for p in querySelectorAll("p.test"): | |||
print (textContent(p)) | |||
</source> |
Revision as of 14:34, 25 May 2022
Scraping (also Screen Scraping) is the process of extracting data out of something.
In the course, we have used the library BeautifulSoup to manipulate HTML pages in Python.
Other interesting libraries to consider:
- Mechanize in essence simulates a browser in Python, that can "remember" things (like cookies / sessions) between pages
- lxml which can apparently deal with "mal-formed" HTML and quickly convert them to xml trees
- html5lib
Resources:
- http://us.pycon.org/2009/tutorials/schedule/2AM8/
- http://scrapy.org/ Python framework for custom scrapers
See Extracting parts of an HTML document and other recipes in the Category:Cookbook
Scraping HTML in Python with html5lib + css selectors (2022)
from xml.etree import ElementTree as ET
import html5lib
def textContent (elt):
if elt.text != None:
ret = elt.text
else:
ret = u""
return ret + u"".join([ET.tostring(x, method="text", encoding="utf8").decode("utf-8") for x in elt])
with open("index.html") as fin:
t = html5lib.parse(fin, namespaceHTMLElements=False)
from cssselect import HTMLTranslator
import elementpath
def querySelector (t, selector):
for elt in elementpath.select(t, HTMLTranslator().css_to_xpath(selector)):
return elt
def querySelectorAll (t, selector):
for elt in elementpath.select(t, HTMLTranslator().css_to_xpath(selector)):
yield elt
for p in querySelectorAll("p.test"):
print (textContent(p))