Calendars:Networked Media Calendar/Networked Media Calendar/16-03-2011 -Event 1: Difference between revisions

From XPUB & Lens-Based wiki
No edit summary
No edit summary
Line 1: Line 1:
11-18 | Nicolas Maleve - Thematic Project
11-18 | Nicolas Maleve - Thematic Project
= Cookbook Recipes for Goodiff Workshop =
=== Simplifying HTML by removing "invisible" parts  (lxml) ===
Use lxml to simplify an HTML page
<source lang="python">
import lxml.html.clean
lxml.html.clean.clean_html(source)
</source>
example:
lxml.html.clean.clean_html("<html><head><title>Hello</title><script>var foo=3;</script></head><body><p>This is <u>some crazy text</u>. OK!</body></html>")
result:
'<div>Hello<body><p>This is <u>some crazy text</u>. OK!</p></body></div>'
=== Stripping all the tags from HTML to get pure text (nltk) ===
You can use nltk.util.clean_html to remove all tags
<source lang="python">
import nltk.util
nltk.util.clean_html(source)
</source>
example:
nltk.util.clean_html("<html><head><title>Hello</title><script>var foo=3;</script></head><body><p>This is <u>some crazy text</u>. OK!</body></html>")
result:
'Hello This is some crazy text . OK!'
=== Looking up synonym-sets for a word (wordnet) ===
<source lang="python">
from nltk.corpus import wordnet
meanings = wordnet.synsets('woman')
for m in meanings:
    print "===", m.name, "==="
    print m.definition
    print "\t* ".join(m.examples)
</source>
=== Splitting text into sentences (nltk) ===
<source lang="python">
from nltk.tokenize import sent_tokenize
print sent_tokenize("I read J.D. Salinger in High School. He wrote 'Catcher in the Rye'.")
</source>
['I read J.D.', 'Salinger in High School.', "He wrote 'Catcher in the Rye'."]
So you can see it's not perfect.
=== Removing common words / stopwords (nltk) ===
<source lang="python">
from nltk.corpus import stopwords
english_stops = set(stopwords.words("english"))
words = "Stopwords are common words that are often handy to remove or ignore when processing text".split()
words = [w for w in words if w not in english_stops]
print words
</source>
=== Finding capitalized words (regex) ===
<source lang="python">
import re
pat = re.compile(r"\b[A-Z]+\b")
print pat.findall(text)
</source>
=== Extracting parts of an HTML document ===
The html5lib parser is code that turns the source text of an HTML page
into a structured object, allowing, for instance, to use CSS selectors
or xpath expressions to select/extract portions of a page
You can use xpath expressions:
<source lang="python">
import html5lib, lxml
htmlsource="<html><body><p>Example page.</p><p>More stuff.</p></body></html>"
htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)
page = htmlparser.parse(htmlsource)
p = page.xpath("/html/body/p[2]")
if p:
    p = p[0]
    print "".join([t for t in p.itertext()])
</source>
outputs:
More stuff.
Also CSS selectors are possible:
<source lang="python">
import html5lib, lxml, lxml.cssselect
htmlsource="<html><body><p>Example page.</p><p>More stuff.</p></body></html>"
htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)
page = htmlparser.parse(htmlsource)
selector = lxml.cssselect.CSSSelector("p")
for p in selector(page):
    print "-"*20
    print "".join([t for t in p.itertext()])
</source>
--------------------
Example page.
--------------------
More stuff.
=== Working with lxml ===
==== Extracting the text contents of a node (lxml) ====
The itertext method of a node can be useful.
<source lang="python">
for t in node.itertext():
    print t
</source>
<source lang="python">
text = "".join(list(node.itertext()))
</source>
==== Turning part of a page back into code (aka serialization) (lxml) ====
Imagine you want to print out the full code of part of a page.
Use lxml.etree.tostring. This converts any node back into source code -- a process called serialization.
<source lang="python">
htmlsource="<html><body><p>Example page.</p><p>More stuff with <i>markup</i>.</p></body></html>"
htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)
page = htmlparser.parse(htmlsource)
selector = lxml.cssselect.CSSSelector("p")
p = selector(page)[1]
print lxml.etree.tostring(p)
</source>

Revision as of 11:59, 16 March 2011

11-18 | Nicolas Maleve - Thematic Project

Cookbook Recipes for Goodiff Workshop

Simplifying HTML by removing "invisible" parts (lxml)

Use lxml to simplify an HTML page

import lxml.html.clean
lxml.html.clean.clean_html(source)

example:

lxml.html.clean.clean_html("<html><head><title>Hello</title><script>var foo=3;</script></head><body>

This is some crazy text. OK!</body></html>") result: '

Hello<body>

This is some crazy text. OK!

</body>

'

Stripping all the tags from HTML to get pure text (nltk)

You can use nltk.util.clean_html to remove all tags

import nltk.util
nltk.util.clean_html(source)

example:

nltk.util.clean_html("<html><head><title>Hello</title><script>var foo=3;</script></head><body>

This is some crazy text. OK!</body></html>") result: 'Hello This is some crazy text . OK!'

Looking up synonym-sets for a word (wordnet)

from nltk.corpus import wordnet
meanings = wordnet.synsets('woman')
for m in meanings:
    print "===", m.name, "==="
    print m.definition
    print "\t* ".join(m.examples)

Splitting text into sentences (nltk)

from nltk.tokenize import sent_tokenize
print sent_tokenize("I read J.D. Salinger in High School. He wrote 'Catcher in the Rye'.")
['I read J.D.', 'Salinger in High School.', "He wrote 'Catcher in the Rye'."]

So you can see it's not perfect.

Removing common words / stopwords (nltk)

from nltk.corpus import stopwords
english_stops = set(stopwords.words("english"))
words = "Stopwords are common words that are often handy to remove or ignore when processing text".split()
words = [w for w in words if w not in english_stops]
print words

Finding capitalized words (regex)

import re
pat = re.compile(r"\b[A-Z]+\b")
print pat.findall(text)

Extracting parts of an HTML document

The html5lib parser is code that turns the source text of an HTML page into a structured object, allowing, for instance, to use CSS selectors or xpath expressions to select/extract portions of a page

You can use xpath expressions:

import html5lib, lxml

htmlsource="<html><body><p>Example page.</p><p>More stuff.</p></body></html>"
htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)
page = htmlparser.parse(htmlsource)
p = page.xpath("/html/body/p[2]")
if p:
    p = p[0]
    print "".join([t for t in p.itertext()])

outputs: More stuff.

Also CSS selectors are possible:

import html5lib, lxml, lxml.cssselect

htmlsource="<html><body><p>Example page.</p><p>More stuff.</p></body></html>"
htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)
page = htmlparser.parse(htmlsource)
selector = lxml.cssselect.CSSSelector("p")
for p in selector(page):
    print "-"*20
    print "".join([t for t in p.itertext()])
--------------------
Example page.
--------------------
More stuff.

Working with lxml

Extracting the text contents of a node (lxml)

The itertext method of a node can be useful.

for t in node.itertext():
    print t
text = "".join(list(node.itertext()))

Turning part of a page back into code (aka serialization) (lxml)

Imagine you want to print out the full code of part of a page. Use lxml.etree.tostring. This converts any node back into source code -- a process called serialization.

htmlsource="<html><body><p>Example page.</p><p>More stuff with <i>markup</i>.</p></body></html>"
htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)
page = htmlparser.parse(htmlsource)
selector = lxml.cssselect.CSSSelector("p")
p = selector(page)[1]
print lxml.etree.tostring(p)