Calendars:Networked Media Calendar/Networked Media Calendar/16-03-2011 -Event 1: Difference between revisions
Renee Turner (talk | contribs) No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
11-18 | Nicolas Maleve - Thematic Project | 11-18 | Nicolas Maleve - Thematic Project | ||
= Cookbook Recipes for Goodiff Workshop = | |||
=== Simplifying HTML by removing "invisible" parts (lxml) === | |||
Use lxml to simplify an HTML page | |||
<source lang="python"> | |||
import lxml.html.clean | |||
lxml.html.clean.clean_html(source) | |||
</source> | |||
example: | |||
lxml.html.clean.clean_html("<html><head><title>Hello</title><script>var foo=3;</script></head><body><p>This is <u>some crazy text</u>. OK!</body></html>") | |||
result: | |||
'<div>Hello<body><p>This is <u>some crazy text</u>. OK!</p></body></div>' | |||
=== Stripping all the tags from HTML to get pure text (nltk) === | |||
You can use nltk.util.clean_html to remove all tags | |||
<source lang="python"> | |||
import nltk.util | |||
nltk.util.clean_html(source) | |||
</source> | |||
example: | |||
nltk.util.clean_html("<html><head><title>Hello</title><script>var foo=3;</script></head><body><p>This is <u>some crazy text</u>. OK!</body></html>") | |||
result: | |||
'Hello This is some crazy text . OK!' | |||
=== Looking up synonym-sets for a word (wordnet) === | |||
<source lang="python"> | |||
from nltk.corpus import wordnet | |||
meanings = wordnet.synsets('woman') | |||
for m in meanings: | |||
print "===", m.name, "===" | |||
print m.definition | |||
print "\t* ".join(m.examples) | |||
</source> | |||
=== Splitting text into sentences (nltk) === | |||
<source lang="python"> | |||
from nltk.tokenize import sent_tokenize | |||
print sent_tokenize("I read J.D. Salinger in High School. He wrote 'Catcher in the Rye'.") | |||
</source> | |||
['I read J.D.', 'Salinger in High School.', "He wrote 'Catcher in the Rye'."] | |||
So you can see it's not perfect. | |||
=== Removing common words / stopwords (nltk) === | |||
<source lang="python"> | |||
from nltk.corpus import stopwords | |||
english_stops = set(stopwords.words("english")) | |||
words = "Stopwords are common words that are often handy to remove or ignore when processing text".split() | |||
words = [w for w in words if w not in english_stops] | |||
print words | |||
</source> | |||
=== Finding capitalized words (regex) === | |||
<source lang="python"> | |||
import re | |||
pat = re.compile(r"\b[A-Z]+\b") | |||
print pat.findall(text) | |||
</source> | |||
=== Extracting parts of an HTML document === | |||
The html5lib parser is code that turns the source text of an HTML page | |||
into a structured object, allowing, for instance, to use CSS selectors | |||
or xpath expressions to select/extract portions of a page | |||
You can use xpath expressions: | |||
<source lang="python"> | |||
import html5lib, lxml | |||
htmlsource="<html><body><p>Example page.</p><p>More stuff.</p></body></html>" | |||
htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False) | |||
page = htmlparser.parse(htmlsource) | |||
p = page.xpath("/html/body/p[2]") | |||
if p: | |||
p = p[0] | |||
print "".join([t for t in p.itertext()]) | |||
</source> | |||
outputs: | |||
More stuff. | |||
Also CSS selectors are possible: | |||
<source lang="python"> | |||
import html5lib, lxml, lxml.cssselect | |||
htmlsource="<html><body><p>Example page.</p><p>More stuff.</p></body></html>" | |||
htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False) | |||
page = htmlparser.parse(htmlsource) | |||
selector = lxml.cssselect.CSSSelector("p") | |||
for p in selector(page): | |||
print "-"*20 | |||
print "".join([t for t in p.itertext()]) | |||
</source> | |||
-------------------- | |||
Example page. | |||
-------------------- | |||
More stuff. | |||
=== Working with lxml === | |||
==== Extracting the text contents of a node (lxml) ==== | |||
The itertext method of a node can be useful. | |||
<source lang="python"> | |||
for t in node.itertext(): | |||
print t | |||
</source> | |||
<source lang="python"> | |||
text = "".join(list(node.itertext())) | |||
</source> | |||
==== Turning part of a page back into code (aka serialization) (lxml) ==== | |||
Imagine you want to print out the full code of part of a page. | |||
Use lxml.etree.tostring. This converts any node back into source code -- a process called serialization. | |||
<source lang="python"> | |||
htmlsource="<html><body><p>Example page.</p><p>More stuff with <i>markup</i>.</p></body></html>" | |||
htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False) | |||
page = htmlparser.parse(htmlsource) | |||
selector = lxml.cssselect.CSSSelector("p") | |||
p = selector(page)[1] | |||
print lxml.etree.tostring(p) | |||
</source> |
Revision as of 11:59, 16 March 2011
11-18 | Nicolas Maleve - Thematic Project
Cookbook Recipes for Goodiff Workshop
Simplifying HTML by removing "invisible" parts (lxml)
Use lxml to simplify an HTML page
import lxml.html.clean
lxml.html.clean.clean_html(source)
example:
lxml.html.clean.clean_html("<html><head><title>Hello</title><script>var foo=3;</script></head><body>
This is some crazy text. OK!</body></html>") result: '
This is some crazy text. OK!
</body>'
Stripping all the tags from HTML to get pure text (nltk)
You can use nltk.util.clean_html to remove all tags
import nltk.util
nltk.util.clean_html(source)
example:
nltk.util.clean_html("<html><head><title>Hello</title><script>var foo=3;</script></head><body>
This is some crazy text. OK!</body></html>") result: 'Hello This is some crazy text . OK!'
Looking up synonym-sets for a word (wordnet)
from nltk.corpus import wordnet
meanings = wordnet.synsets('woman')
for m in meanings:
print "===", m.name, "==="
print m.definition
print "\t* ".join(m.examples)
Splitting text into sentences (nltk)
from nltk.tokenize import sent_tokenize
print sent_tokenize("I read J.D. Salinger in High School. He wrote 'Catcher in the Rye'.")
['I read J.D.', 'Salinger in High School.', "He wrote 'Catcher in the Rye'."]
So you can see it's not perfect.
Removing common words / stopwords (nltk)
from nltk.corpus import stopwords
english_stops = set(stopwords.words("english"))
words = "Stopwords are common words that are often handy to remove or ignore when processing text".split()
words = [w for w in words if w not in english_stops]
print words
Finding capitalized words (regex)
import re
pat = re.compile(r"\b[A-Z]+\b")
print pat.findall(text)
Extracting parts of an HTML document
The html5lib parser is code that turns the source text of an HTML page into a structured object, allowing, for instance, to use CSS selectors or xpath expressions to select/extract portions of a page
You can use xpath expressions:
import html5lib, lxml
htmlsource="<html><body><p>Example page.</p><p>More stuff.</p></body></html>"
htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)
page = htmlparser.parse(htmlsource)
p = page.xpath("/html/body/p[2]")
if p:
p = p[0]
print "".join([t for t in p.itertext()])
outputs: More stuff.
Also CSS selectors are possible:
import html5lib, lxml, lxml.cssselect
htmlsource="<html><body><p>Example page.</p><p>More stuff.</p></body></html>"
htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)
page = htmlparser.parse(htmlsource)
selector = lxml.cssselect.CSSSelector("p")
for p in selector(page):
print "-"*20
print "".join([t for t in p.itertext()])
-------------------- Example page. -------------------- More stuff.
Working with lxml
Extracting the text contents of a node (lxml)
The itertext method of a node can be useful.
for t in node.itertext():
print t
text = "".join(list(node.itertext()))
Turning part of a page back into code (aka serialization) (lxml)
Imagine you want to print out the full code of part of a page. Use lxml.etree.tostring. This converts any node back into source code -- a process called serialization.
htmlsource="<html><body><p>Example page.</p><p>More stuff with <i>markup</i>.</p></body></html>"
htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False)
page = htmlparser.parse(htmlsource)
selector = lxml.cssselect.CSSSelector("p")
p = selector(page)[1]
print lxml.etree.tostring(p)