Stripping all the tags from HTML to get pure text

From XPUB & Lens-Based wiki
Revision as of 12:39, 16 March 2011 by Aymeric Mansoux (talk | contribs) (Created page with "You can use nltk.util.clean_html to remove all tags <source lang="python"> import nltk.util nltk.util.clean_html(source) </source> example: <source lang="python"> nltk.util.cle...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

You can use nltk.util.clean_html to remove all tags

import nltk.util
nltk.util.clean_html(source)

example:

nltk.util.clean_html("<html><head><title>Hello</title><script>var foo=3;</script></head><body><p>This is <u>some crazy text</u>. OK!</body></html>")
result:
'Hello This is some crazy text . OK!'