Simplifying HTML by removing "invisible" parts

From XPUB & Lens-Based wiki
Revision as of 12:36, 16 March 2011 by Aymeric Mansoux (talk | contribs) (Created page with "Use lxml to simplify an HTML page <source lang="python"> import lxml.html.clean lxml.html.clean.clean_html(source) </source> example: <nowiki> lxml.html.clean.clean_html("<htm...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Use lxml to simplify an HTML page

import lxml.html.clean
lxml.html.clean.clean_html(source)

example: lxml.html.clean.clean_html("<html><head><title>Hello</title><script>var foo=3;</script></head><body><p>This is <u>some crazy text</u>. OK!</body></html>")

result:

'

Hello<body>

This is some crazy text. OK!

</body>

'