Diffs in Python

From XPUB & Lens-Based wiki

Typically, diff is used to show the changes between two versions of the same file.

The output of similar file comparison utilities are also called a "diff"

From Wikipedia article on the Diff utility


diffs between strings

I want to detect differences between 2 strings. Detecting both their position and content. There seems to be to ways a of doing it in python, one more rudimentary using != operator and another using difflib

a="Hello dog"
b="Hello god"
for i in range(len(a)):
    if a[i] != b[i]: print i, a[i], b[i]
6 d g
8 g d

This method has the inconvenience of only allowing strings with the same length.


Difflib is a Python module dedicate to comparing sequences.

difflib.SequenceMatcher is a class dedicated to comparing pairs of sequences.

import difflib
a="Hello dog"
b="Hello god"
s = difflib.SequenceMatcher(None, a,b)

for block in s.get_matching_blocks():
    print block
Match(a=0, b=0, size=6)
Match(a=6, b=8, size=1)
Match(a=9, b=9, size=0)


difflib.Differ compares sequences of lines of text, and produces human-readable differences or deltas. The differences are distinguished by having each comparison line begin with:

'- ' 	line unique to sequence 1
'+ '  	line unique to sequence 2
'  ' 	line common to both sequences
'? ' 	line not present in either input sequence
import difflib
a="Hello dog"
b="Hello god"
d = difflib.Differ()
diff=d.compare(a,b)
diff=d.compare(a,b)
for i in diff:
    print i
  H
  e
  l
  l
  o
   
+ g
+ o
  d
- o
- g


get_opcodes() Return list of 5-tuples describing how to turn a into b

import difflib
a="Hello dog"
b="Hello god"
s = difflib.SequenceMatcher(None, a, b)
for tag, i1, i2, j1, j2 in s.get_opcodes():
    print "{tag} a[{i1}:{i2}] ({a_str}) b[{j1}:{j2}] ({b_str})".format(tag=tag, i1=i1, i2=i2, a_str=a[i1:i2], j1=j1, j2=j2, b_str=b[j1:j2])
equal a[0:6] (Hello ) b[0:6] (Hello )
insert a[6:6] () b[6:8] (go)
equal a[6:7] (d) b[8:9] (d)
delete a[7:9] (og) b[9:9] ()


doing things with diffs

reconstruct and mark

I want to use the opcodes to mark the changes that happen to arrive to the most recent.

#!/usr/bin/env python                                                                     
import difflib                                                                            
# reconstruct be, with marks indicating the changes                                       
                                                                                          
a="Hello dog"                                                                             
b="Hello god, finally"                                                                    
s = difflib.SequenceMatcher(None, a, b)                                                   
x=''                                                                                      
                                                                                          
for tag, i1, i2, j1, j2 in s.get_opcodes():                                               
    if tag is "insert":                                                                   
        b_str=b[j1:j2]                                                                    
        x = x + "**"+b_str+"**"                                                           
#        print 'insert','b[{i1}:{i2}]'.format(i1=i1,i2=i2), b_str                         
                                                                                          
    elif tag is "replace":                                                                
        b_str=b[j1:j2]                                                                    
        x = x + "_"+b_str+"_"                                                             
#        print 'replace','b[{i1}:{i2}]'.format(i1=i1,i2=i2), b_str                        
                                                                                          
    elif tag is 'equal':                                                                  
        b_str=b[j1:j2]                                                                    
#        print 'equal','b[{i1}:{i2}]'.format(i1=i1,i2=i2), b_str                          
        x = x + b_str                                                                     
                                                                                          
    elif tag is 'delete':                                                                 
        b_str=b[i1:i2]                                                                    
        if len(b_str)>0: #to avoid empty deleted parts                                    
            x = x + "-"+b_str+"-"                                                         
#            print 'delete','a[{i1}:{i2}]'.format(i1=i1,i2=i2), b_str                     
                                                                                          
                                                                                          
print x
Hello **go**d_, finally_

This tell us that:

  • "go" was inserted
  • ", finally" was replace

Let's begin with "equals" and "inserts", where "equals" get are not pointed out, but inserts are wrapped in "**". If done manually this would result in.

x = b[0:6] + '**'+b[6:8]+'**' + b[8:9]

yet we can loop the opcodes to produce such results. Notice that the opcodes are printed from the beginning to the end of the strings.