Regular expressions

From XPUB & Lens-Based wiki

Load some text from a file

Imagine you have some text, say from a text file:

text = open("pg105.txt").read()

Finding a pattern with .findall

Findall returns different things depending on how many parentheses () you have in your pattern.

If there are no parentheses, it returns the complete text of the match one by one in the order it's found in the text:

for match in re.findall(r"the \w+", text):
    print match
   the use
   the terms
   the Project
   the Baronetage
   the limited
   the earliest
   the almost
   the last
   the page
   the favourite

If there's one pair of parentheses, only the text inside the parentheses is returned:

for match in re.findall(r"the (\w+)", text):
    print match
   use
   terms
   Project
   Baronetage
   limited
   earliest
   almost
   last
   page
   favourite

Finally, if there are multiple pairs of parentheses, findall returns a tuple of each:

for match in re.findall(r"(\w+) the (\w+)", text):
    print match
   ('for', 'use')
   ('under', 'terms')
   ('of', 'Project')
   ('but', 'Baronetage')
   ('contemplating', 'limited')
   ('of', 'earliest')
   ('over', 'almost')
   ('of', 'last')
   ('was', 'page')
   ('which', 'favourite')

Search & Replace with .sub

print re.sub(r"the (\w+)", r"the ONLY \1", text)
   This eBook is for the ONLY use of anyone anywhere at no cost and with
   almost no restrictions whatsoever.  You may copy it, give it away or
   re-use it under the ONLY terms of the ONLY Project Gutenberg License included
   with this eBook or online at www.gutenberg.net

References

THis script (multisub.py) takes a file with substitutions and transforms the input.

from argparse import ArgumentParser
import re, sys

p = ArgumentParser()
p.add_argument("subs", default=None, help="file containing substitutions, one per line, split by one or more tabs")
p.add_argument("--case", default=False, action="store_true", help="case sensitive")
p.add_argument("--no-boundary", default=False, action="store_true", help="don't pad match with word substitutions")
args = p.parse_args()

# Read the substitution patterns
with open(args.subs) as f:
    lines = [x.strip() for x in f.readlines() if x.strip() and not x.startswith("#")]
    subs = [[p.strip() for p in re.split(r"\t+", x.decode("utf-8"), maxsplit=1)] for x in lines]

for line in sys.stdin:
    line = line.decode("utf-8")
    for search, replace in subs:
    	if not args.no_boundary:
    		search = r"\b{0}\b".format(search)
        if args.case:
            line = re.sub(search, replace, line)
        else:
            line = re.sub(search, replace, line, flags=re.I)
    sys.stdout.write(line.encode("utf-8"))