Scraping: Difference between revisions

From XPUB & Lens-Based wiki
(New page: Scraping (also Screen Scraping) is the process of extracting data out of something. In the course, we have used the library Beautiful Soup to manipulate HTML pages in Python. Oth...)
 
 
(16 intermediate revisions by the same user not shown)
Line 1: Line 1:
Scraping (also Screen Scraping) is the process of extracting data out of something.
Scraping (also Screen Scraping) is the process of extracting data out of something.


In the course, we have used the library [[Beautiful Soup]] to manipulate HTML pages in [[Python]].
In the course, we have used the library [[BeautifulSoup]] to manipulate HTML pages in [[Python]].


Other interesting libraries to consider:
Other interesting libraries to consider:
* [http://codespeak.net/lxml/ lxml]
* [http://wwwsearch.sourceforge.net/mechanize/ Mechanize] in essence simulates a browser in Python, that can "remember" things (like cookies / sessions) between pages
* [http://codespeak.net/lxml/ lxml] which can apparently deal with "mal-formed" HTML and quickly convert them to xml trees
* [http://code.google.com/p/html5lib/ html5lib]
* [http://code.google.com/p/html5lib/ html5lib]
Resources:
* http://us.pycon.org/2009/tutorials/schedule/2AM8/
* http://scrapy.org/ Python framework for custom scrapers
See [[Extracting parts of an HTML document]] and other recipes in the [[:Category:Cookbook]]
== Scraping HTML in Python with html5lib + css selectors (2022) ==
html5lib is the modern python library for parsing "HTML in the wild" ... it's basically guaranteed to handle any HTML content and will correct broken tags, etc. to make sure whatever you load is "well-formed". Alternatively, you can use a library like BeautifulSoup.
    pip install html5lib elementpath cssselect
<source lang="python">
from xml.etree import ElementTree as ET
import html5lib
def textContent (elt):
    if elt.text != None:
        ret = elt.text
    else:
        ret = u""
    return ret + u"".join([ET.tostring(x, method="text", encoding="utf8").decode("utf-8") for x in elt])
with open("index.html") as fin:
    t = html5lib.parse(fin, namespaceHTMLElements=False)
print (textContent(t.find(".//p")))
</source>
<source lang="python">
from cssselect import HTMLTranslator
import elementpath
def querySelector (t, selector):
    for elt in elementpath.select(t, HTMLTranslator().css_to_xpath(selector)):
        return elt
def querySelectorAll (t, selector):
    for elt in elementpath.select(t, HTMLTranslator().css_to_xpath(selector)):
        yield elt
for p in querySelectorAll("p.test"):
    print (textContent(p))
</source>

Latest revision as of 15:36, 25 May 2022

Scraping (also Screen Scraping) is the process of extracting data out of something.

In the course, we have used the library BeautifulSoup to manipulate HTML pages in Python.

Other interesting libraries to consider:

  • Mechanize in essence simulates a browser in Python, that can "remember" things (like cookies / sessions) between pages
  • lxml which can apparently deal with "mal-formed" HTML and quickly convert them to xml trees
  • html5lib

Resources:

See Extracting parts of an HTML document and other recipes in the Category:Cookbook

Scraping HTML in Python with html5lib + css selectors (2022)

html5lib is the modern python library for parsing "HTML in the wild" ... it's basically guaranteed to handle any HTML content and will correct broken tags, etc. to make sure whatever you load is "well-formed". Alternatively, you can use a library like BeautifulSoup.

   pip install html5lib elementpath cssselect
from xml.etree import ElementTree as ET
import html5lib

def textContent (elt):
    if elt.text != None:
        ret = elt.text
    else:
        ret = u""
    return ret + u"".join([ET.tostring(x, method="text", encoding="utf8").decode("utf-8") for x in elt])

with open("index.html") as fin:
    t = html5lib.parse(fin, namespaceHTMLElements=False)

print (textContent(t.find(".//p")))
from cssselect import HTMLTranslator
import elementpath

def querySelector (t, selector):
    for elt in elementpath.select(t, HTMLTranslator().css_to_xpath(selector)):
        return elt

def querySelectorAll (t, selector):
    for elt in elementpath.select(t, HTMLTranslator().css_to_xpath(selector)):
        yield elt

for p in querySelectorAll("p.test"):
    print (textContent(p))