Regular Expressions

The python documenation of the "re" module: http://docs.python.org/library/re.html

Visualizing regular expressions: http://osteele.com/tools/reanimator/

Syntax

Python raw strings

2 ways to use each function (direct and via a "compiled" regex object)

Characters
Character Classes ([], -)
\ for literals
\w \d \s (\W \S \D) Pre-defined char classes
Manyness (+*?{})
Limiting "greediness" with an extra ?
Anchors (^ $ \b)
Grouping (), (?:) (?P<name>)

More advanced

Backreferencing \1 (?P=name)
(?=...) Lookahead, (?!...) Negative lookahead
(?<=...) Lookbehind, (?<!...) Negative lookbehind

Using Locale specific RegEx

import locale
locale.setlocale(locale.LC_ALL, ("fr", "UTF-8"))
#
import re
re.compile(r"\bje\s+(\w)+\b", re.L)

search
split
findall, finditer
sub (pattern, repl, string count)

re.escape

Match objects

.group, .group(0)!
.groupdict()

Usage

"Sniffing" / Searching

.search

   ie is this an image file?...

(Example of re_find... with os.walk)

Automatic markup

Extraction (URL, structured text / mini-markup)

 Example "I *blank*"

Splitting

   Example: Splitting an SRT with timecode pattern...
   (Try as simple replacement for nltk.tokenizer...)

Outside Python

"Regex Search & Replace" gedit plugin
"rename" command

import re

text = open("shelley_frankenstein_trimmed.txt").read()
# text = open("nl.txt").read()

#for p in re.finditer(r"I (\w+)", text, re.I):
#    # print p.group(0)
#    print p.group(1)

print re.sub(r"I (\w+)", r"I <u>\1</u>", text[:1000])

print re.findall(r"\b\d\d\d\d\b", text)

print re.findall(r"\bI [a-z]+\b", text)


print re.findall(r"\b[a-z]+ly\b", text)

print re.sub(r"\b[a-z]+ly\b", r"<u>\g<0></u>", text)

print re.sub(r"\bI ([a-z]+)\b", r"I <b>\g<1></b>", text)

print re.sub(r"\b([a-z])[a-z]+\s\1[a-z]+\b", r"<b>\g<0></b>", text)

Syllabus 20100223

Contents