Simplifying HTML by removing "invisible" parts

From XPUB & Lens-Based wiki
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Use lxml to simplify an HTML page

import lxml.html.clean
lxml.html.clean.clean_html(source)

example: lxml.html.clean.clean_html("<html><head><title>Hello</title><script>var foo=3;</script></head><body><p>This is <u>some crazy text</u>. OK!</body></html>")

result:

'

Hello<body>

This is some crazy text. OK!

</body>

'