Stripping all the tags from HTML to get pure text
You can use nltk.util.clean_html to remove all tags
import nltk.util
nltk.util.clean_html(source)
example:
nltk.util.clean_html("<html><head><title>Hello</title><script>var foo=3;</script></head><body><p>This is <u>some crazy text</u>. OK!</body></html>")
result: 'Hello This is some crazy text . OK!'