Simplifying HTML by removing "invisible" parts

From Media Design: Networked & Lens-Based wiki
Revision as of 12:36, 16 March 2011 by Aymeric Mansoux (talk | contribs) (Created page with "Use lxml to simplify an HTML page <source lang="python"> import lxml.html.clean lxml.html.clean.clean_html(source) </source> example: <nowiki> lxml.html.clean.clean_html("<htm...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Use lxml to simplify an HTML page

import lxml.html.clean
lxml.html.clean.clean_html(source)

example: lxml.html.clean.clean_html("<html><head><title>Hello</title><script>var foo=3;</script></head><body><p>This is <u>some crazy text</u>. OK!</body></html>")

result:

'

Hello<body>

This is some crazy text. OK!

</body>

'