Regular expressions: Difference between revisions
(Created page with "<source lang="python"> for match in re.findall(r"the \w+", text): print match </source> the use the terms the Project the Baronetage the limited t...") |
No edit summary |
||
(5 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
== Load some text from a file == | |||
Imagine you have some text, say from a text file: | |||
<source lang="python"> | |||
text = open("pg105.txt").read() | |||
</source> | |||
== Finding a pattern with .findall == | |||
Findall returns different things depending on how many parentheses () you have in your pattern. | |||
If there are no parentheses, it returns the complete text of the match one by one in the order it's found in the text: | |||
<source lang="python"> | <source lang="python"> | ||
for match in re.findall(r"the \w+", text): | for match in re.findall(r"the \w+", text): | ||
Line 15: | Line 26: | ||
the favourite | the favourite | ||
If there's one pair of parentheses, only the text inside the parentheses is returned: | |||
<source lang="python"> | <source lang="python"> | ||
for match in re.findall(r"the (\w+)", text): | for match in re.findall(r"the (\w+)", text): | ||
Line 31: | Line 43: | ||
favourite | favourite | ||
Finally, if there are multiple pairs of parentheses, findall returns a tuple of each: | |||
<source lang="python"> | <source lang="python"> | ||
for match in re.findall(r"(\w+) the (\w+)", text): | for match in re.findall(r"(\w+) the (\w+)", text): | ||
Line 47: | Line 59: | ||
('was', 'page') | ('was', 'page') | ||
('which', 'favourite') | ('which', 'favourite') | ||
== Search & Replace with .sub == | |||
<source lang="python"> | |||
print re.sub(r"the (\w+)", r"the ONLY \1", text) | |||
</source> | |||
This eBook is for the ONLY use of anyone anywhere at no cost and with | |||
almost no restrictions whatsoever. You may copy it, give it away or | |||
re-use it under the ONLY terms of the ONLY Project Gutenberg License included | |||
with this eBook or online at www.gutenberg.net | |||
== References == | |||
* [A cheat sheet https://gist.github.com/ccstone/5385334] | |||
THis script (multisub.py) takes a file with substitutions and transforms the input. | |||
<source lang="python"> | |||
from argparse import ArgumentParser | |||
import re, sys | |||
p = ArgumentParser() | |||
p.add_argument("subs", default=None, help="file containing substitutions, one per line, split by one or more tabs") | |||
p.add_argument("--case", default=False, action="store_true", help="case sensitive") | |||
p.add_argument("--no-boundary", default=False, action="store_true", help="don't pad match with word substitutions") | |||
args = p.parse_args() | |||
# Read the substitution patterns | |||
with open(args.subs) as f: | |||
lines = [x.strip() for x in f.readlines() if x.strip() and not x.startswith("#")] | |||
subs = [[p.strip() for p in re.split(r"\t+", x.decode("utf-8"), maxsplit=1)] for x in lines] | |||
for line in sys.stdin: | |||
line = line.decode("utf-8") | |||
for search, replace in subs: | |||
if not args.no_boundary: | |||
search = r"\b{0}\b".format(search) | |||
if args.case: | |||
line = re.sub(search, replace, line) | |||
else: | |||
line = re.sub(search, replace, line, flags=re.I) | |||
sys.stdout.write(line.encode("utf-8")) | |||
</source> |
Latest revision as of 16:22, 7 April 2015
Load some text from a file
Imagine you have some text, say from a text file:
text = open("pg105.txt").read()
Finding a pattern with .findall
Findall returns different things depending on how many parentheses () you have in your pattern.
If there are no parentheses, it returns the complete text of the match one by one in the order it's found in the text:
for match in re.findall(r"the \w+", text):
print match
the use the terms the Project the Baronetage the limited the earliest the almost the last the page the favourite
If there's one pair of parentheses, only the text inside the parentheses is returned:
for match in re.findall(r"the (\w+)", text):
print match
use terms Project Baronetage limited earliest almost last page favourite
Finally, if there are multiple pairs of parentheses, findall returns a tuple of each:
for match in re.findall(r"(\w+) the (\w+)", text):
print match
('for', 'use') ('under', 'terms') ('of', 'Project') ('but', 'Baronetage') ('contemplating', 'limited') ('of', 'earliest') ('over', 'almost') ('of', 'last') ('was', 'page') ('which', 'favourite')
Search & Replace with .sub
print re.sub(r"the (\w+)", r"the ONLY \1", text)
This eBook is for the ONLY use of anyone anywhere at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the ONLY terms of the ONLY Project Gutenberg License included with this eBook or online at www.gutenberg.net
References
- [A cheat sheet https://gist.github.com/ccstone/5385334]
THis script (multisub.py) takes a file with substitutions and transforms the input.
from argparse import ArgumentParser
import re, sys
p = ArgumentParser()
p.add_argument("subs", default=None, help="file containing substitutions, one per line, split by one or more tabs")
p.add_argument("--case", default=False, action="store_true", help="case sensitive")
p.add_argument("--no-boundary", default=False, action="store_true", help="don't pad match with word substitutions")
args = p.parse_args()
# Read the substitution patterns
with open(args.subs) as f:
lines = [x.strip() for x in f.readlines() if x.strip() and not x.startswith("#")]
subs = [[p.strip() for p in re.split(r"\t+", x.decode("utf-8"), maxsplit=1)] for x in lines]
for line in sys.stdin:
line = line.decode("utf-8")
for search, replace in subs:
if not args.no_boundary:
search = r"\b{0}\b".format(search)
if args.case:
line = re.sub(search, replace, line)
else:
line = re.sub(search, replace, line, flags=re.I)
sys.stdout.write(line.encode("utf-8"))