Stripping all the tags from HTML to get pure text

From XPUB & Lens-Based wiki
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

You can use nltk.util.clean_html to remove all tags

import nltk.util
nltk.util.clean_html(source)

example:

nltk.util.clean_html("<html><head><title>Hello</title><script>var foo=3;</script></head><body><p>This is <u>some crazy text</u>. OK!</body></html>")
result:
'Hello This is some crazy text . OK!'