Diffs in Python
Typically, diff is used to show the changes between two versions of the same file.
The output of similar file comparison utilities are also called a "diff"
From Wikipedia article on the Diff utility
diffs between strings
I want to detect differences between 2 strings. Detecting both their position and content.
There seems to be to ways a of doing it in python, one more rudimentary using !=
operator
and another using difflib
a="Hello dog"
b="Hello god"
for i in range(len(a)):
if a[i] != b[i]: print i, a[i], b[i]
6 d g 8 g d
This method has the inconvenience of only allowing strings with the same length.
Difflib is a Python module dedicate to comparing sequences.
difflib.SequenceMatcher is a class dedicated to comparing pairs of sequences.
import difflib
a="Hello dog"
b="Hello god"
s = difflib.SequenceMatcher(None, a,b)
for block in s.get_matching_blocks():
print block
Match(a=0, b=0, size=6) Match(a=6, b=8, size=1) Match(a=9, b=9, size=0)
difflib.Differ
compares sequences of lines of text, and produces human-readable differences or deltas. The differences are distinguished by having each comparison line begin with:
'- ' line unique to sequence 1 '+ ' line unique to sequence 2 ' ' line common to both sequences '? ' line not present in either input sequence
import difflib
a="Hello dog"
b="Hello god"
d = difflib.Differ()
diff=d.compare(a,b)
diff=d.compare(a,b)
for i in diff:
print i
H e l l o + g + o d - o - g
get_opcodes()
Return list of 5-tuples describing how to turn a into b
import difflib
a="Hello dog"
b="Hello god"
s = difflib.SequenceMatcher(None, a, b)
for tag, i1, i2, j1, j2 in s.get_opcodes():
print "{tag} a[{i1}:{i2}] ({a_str}) b[{j1}:{j2}] ({b_str})".format(tag=tag, i1=i1, i2=i2, a_str=a[i1:i2], j1=j1, j2=j2, b_str=b[j1:j2])
equal a[0:6] (Hello ) b[0:6] (Hello ) insert a[6:6] () b[6:8] (go) equal a[6:7] (d) b[8:9] (d) delete a[7:9] (og) b[9:9] ()
doing things with diffs
reconstruct and mark
I want to use the opcodes to mark the changes that happen to arrive to the most recent.
#!/usr/bin/env python
import difflib
# reconstruct be, with marks indicating the changes
a="Hello dog"
b="Hello god, finally"
s = difflib.SequenceMatcher(None, a, b)
x=''
for tag, i1, i2, j1, j2 in s.get_opcodes():
if tag is "insert":
b_str=b[j1:j2]
x = x + "**"+b_str+"**"
# print 'insert','b[{i1}:{i2}]'.format(i1=i1,i2=i2), b_str
elif tag is "replace":
b_str=b[j1:j2]
x = x + "_"+b_str+"_"
# print 'replace','b[{i1}:{i2}]'.format(i1=i1,i2=i2), b_str
elif tag is 'equal':
b_str=b[j1:j2]
# print 'equal','b[{i1}:{i2}]'.format(i1=i1,i2=i2), b_str
x = x + b_str
elif tag is 'delete':
b_str=b[j1:j2]
if len(b_str)>0: #to avoid empty deleted parts
x = x + "-"+b_str+"-"
# print 'delete','a[{i1}:{i2}]'.format(i1=i1,i2=i2), b_str
print x
Hello **go**d_, finally_
This tell us that:
- "go" was inserted
- ", finally" was replace
Let's begin with "equals" and "inserts", where "equals" get are not pointed out, but inserts are wrapped in "**". If done manually this would result in.
x = b[0:6] + '**'+b[6:8]+'**' + b[8:9]
yet we can loop the opcodes to produce such results. Notice that the opcodes are printed from the beginning to the end of the strings.