Diffs in Python: Difference between revisions
(Created page with "''Typically, diff is used to show the changes between two versions of the same file.'' ''The output of similar file comparison utilities are also called a "diff"'' From Wiki...") |
|||
Line 125: | Line 125: | ||
elif tag is 'delete': | elif tag is 'delete': | ||
b_str=b[ | b_str=b[i1:i2] | ||
if len(b_str)>0: #to avoid empty deleted parts | if len(b_str)>0: #to avoid empty deleted parts | ||
x = x + "-"+b_str+"-" | x = x + "-"+b_str+"-" |
Latest revision as of 10:24, 25 May 2015
Typically, diff is used to show the changes between two versions of the same file.
The output of similar file comparison utilities are also called a "diff"
From Wikipedia article on the Diff utility
diffs between strings
I want to detect differences between 2 strings. Detecting both their position and content.
There seems to be to ways a of doing it in python, one more rudimentary using !=
operator
and another using difflib
a="Hello dog"
b="Hello god"
for i in range(len(a)):
if a[i] != b[i]: print i, a[i], b[i]
6 d g 8 g d
This method has the inconvenience of only allowing strings with the same length.
Difflib is a Python module dedicate to comparing sequences.
difflib.SequenceMatcher is a class dedicated to comparing pairs of sequences.
import difflib
a="Hello dog"
b="Hello god"
s = difflib.SequenceMatcher(None, a,b)
for block in s.get_matching_blocks():
print block
Match(a=0, b=0, size=6) Match(a=6, b=8, size=1) Match(a=9, b=9, size=0)
difflib.Differ
compares sequences of lines of text, and produces human-readable differences or deltas. The differences are distinguished by having each comparison line begin with:
'- ' line unique to sequence 1 '+ ' line unique to sequence 2 ' ' line common to both sequences '? ' line not present in either input sequence
import difflib
a="Hello dog"
b="Hello god"
d = difflib.Differ()
diff=d.compare(a,b)
diff=d.compare(a,b)
for i in diff:
print i
H e l l o + g + o d - o - g
get_opcodes()
Return list of 5-tuples describing how to turn a into b
import difflib
a="Hello dog"
b="Hello god"
s = difflib.SequenceMatcher(None, a, b)
for tag, i1, i2, j1, j2 in s.get_opcodes():
print "{tag} a[{i1}:{i2}] ({a_str}) b[{j1}:{j2}] ({b_str})".format(tag=tag, i1=i1, i2=i2, a_str=a[i1:i2], j1=j1, j2=j2, b_str=b[j1:j2])
equal a[0:6] (Hello ) b[0:6] (Hello ) insert a[6:6] () b[6:8] (go) equal a[6:7] (d) b[8:9] (d) delete a[7:9] (og) b[9:9] ()
doing things with diffs
reconstruct and mark
I want to use the opcodes to mark the changes that happen to arrive to the most recent.
#!/usr/bin/env python
import difflib
# reconstruct be, with marks indicating the changes
a="Hello dog"
b="Hello god, finally"
s = difflib.SequenceMatcher(None, a, b)
x=''
for tag, i1, i2, j1, j2 in s.get_opcodes():
if tag is "insert":
b_str=b[j1:j2]
x = x + "**"+b_str+"**"
# print 'insert','b[{i1}:{i2}]'.format(i1=i1,i2=i2), b_str
elif tag is "replace":
b_str=b[j1:j2]
x = x + "_"+b_str+"_"
# print 'replace','b[{i1}:{i2}]'.format(i1=i1,i2=i2), b_str
elif tag is 'equal':
b_str=b[j1:j2]
# print 'equal','b[{i1}:{i2}]'.format(i1=i1,i2=i2), b_str
x = x + b_str
elif tag is 'delete':
b_str=b[i1:i2]
if len(b_str)>0: #to avoid empty deleted parts
x = x + "-"+b_str+"-"
# print 'delete','a[{i1}:{i2}]'.format(i1=i1,i2=i2), b_str
print x
Hello **go**d_, finally_
This tell us that:
- "go" was inserted
- ", finally" was replace
Let's begin with "equals" and "inserts", where "equals" get are not pointed out, but inserts are wrapped in "**". If done manually this would result in.
x = b[0:6] + '**'+b[6:8]+'**' + b[8:9]
yet we can loop the opcodes to produce such results. Notice that the opcodes are printed from the beginning to the end of the strings.