Syllabus 20100223

From XPUB & Lens-Based wiki

Regular Expressions

The python documenation of the "re" module: http://docs.python.org/library/re.html

Visualizing regular expressions: http://osteele.com/tools/reanimator/

chutesladders.gif

Syntax

Python raw strings

2 ways to use each function (direct and via a "compiled" regex object)

  • Characters
  • Character Classes ([], -)
  • \ for literals
  • \w \d \s (\W \S \D) Pre-defined char classes
  • Manyness (+*?{})
  • Limiting "greediness" with an extra ?
  • Anchors (^ $ \b)
  • Grouping (), (?:) (?P<name>)

More advanced

  • Backreferencing \1 (?P=name)
  • (?=...) Lookahead, (?!...) Negative lookahead
  • (?<=...) Lookbehind, (?<!...) Negative lookbehind


Using Locale specific RegEx

import locale
locale.setlocale(locale.LC_ALL, ("fr", "UTF-8"))
#
import re
re.compile(r"\bje\s+(\w)+\b", re.L)
  • search
  • split
  • findall, finditer
  • sub (pattern, repl, string count)

re.escape

Match objects

  • .group, .group(0)!
  • .groupdict()


Usage

"Sniffing" / Searching

.search

   ie is this an image file?...

(Example of re_find... with os.walk)

Automatic markup

Extraction (URL, structured text / mini-markup)

 Example "I *blank*"

Splitting

   Example: Splitting an SRT with timecode pattern...
   (Try as simple replacement for nltk.tokenizer...)

Outside Python

  • "Regex Search & Replace" gedit plugin
  • "rename" command
import re

text = open("shelley_frankenstein_trimmed.txt").read()
# text = open("nl.txt").read()

#for p in re.finditer(r"I (\w+)", text, re.I):
#    # print p.group(0)
#    print p.group(1)

print re.sub(r"I (\w+)", r"I <u>\1</u>", text[:1000])

print re.findall(r"\b\d\d\d\d\b", text)

print re.findall(r"\bI [a-z]+\b", text)


print re.findall(r"\b[a-z]+ly\b", text)

print re.sub(r"\b[a-z]+ly\b", r"<u>\g<0></u>", text)

print re.sub(r"\bI ([a-z]+)\b", r"I <b>\g<1></b>", text)

print re.sub(r"\b([a-z])[a-z]+\s\1[a-z]+\b", r"<b>\g<0></b>", text)