Syllabus 20100223
Revision as of 20:33, 23 September 2010 by Migratebot (talk | contribs) (Created page with "= Regular Expressions =
The python documenation of the "re" module:
http://docs.python.org/library/re.html
Visualizing regular expressions:
http://osteele.com/tools/r...")
Regular Expressions
The python documenation of the "re" module: http://docs.python.org/library/re.html
Visualizing regular expressions: http://osteele.com/tools/reanimator/
Syntax
Python raw strings
2 ways to use each function (direct and via a "compiled" regex object)
- Characters
- Character Classes ([], -)
- \ for literals
- \w \d \s (\W \S \D) Pre-defined char classes
- Manyness (+*?{})
- Limiting "greediness" with an extra ?
- Anchors (^ $ \b)
- Grouping (), (?:) (?P<name>)
More advanced
- Backreferencing \1 (?P=name)
- (?=...) Lookahead, (?!...) Negative lookahead
- (?<=...) Lookbehind, (?<!...) Negative lookbehind
Using Locale specific RegEx
import locale
locale.setlocale(locale.LC_ALL, ("fr", "UTF-8"))
#
import re
re.compile(r"\bje\s+(\w)+\b", re.L)
- search
- split
- findall, finditer
- sub (pattern, repl, string count)
re.escape
Match objects
- .group, .group(0)!
- .groupdict()
Usage
"Sniffing" / Searching
.search
ie is this an image file?...
(Example of re_find... with os.walk)
Automatic markup
Extraction (URL, structured text / mini-markup)
Example "I *blank*"
Splitting
Example: Splitting an SRT with timecode pattern... (Try as simple replacement for nltk.tokenizer...)
Outside Python
- "Regex Search & Replace" gedit plugin
- "rename" command
import re
text = open("shelley_frankenstein_trimmed.txt").read()
# text = open("nl.txt").read()
#for p in re.finditer(r"I (\w+)", text, re.I):
# # print p.group(0)
# print p.group(1)
print re.sub(r"I (\w+)", r"I <u>\1</u>", text[:1000])
print re.findall(r"\b\d\d\d\d\b", text)
print re.findall(r"\bI [a-z]+\b", text)
print re.findall(r"\b[a-z]+ly\b", text)
print re.sub(r"\b[a-z]+ly\b", r"<u>\g<0></u>", text)
print re.sub(r"\bI ([a-z]+)\b", r"I <b>\g<1></b>", text)
print re.sub(r"\b([a-z])[a-z]+\s\1[a-z]+\b", r"<b>\g<0></b>", text)