Diffs in Python

From Media Design: Networked & Lens-Based wiki
Jump to navigation Jump to search

Typically, diff is used to show the changes between two versions of the same file.

The output of similar file comparison utilities are also called a "diff"

From Wikipedia article on the Diff utility


diffs between strings

I want to detect differences between 2 strings. Detecting both their position and content. There seems to be to ways a of doing it in python, one more rudimentary using != operator and another using difflib

a="Hello dog"
b="Hello god"
for i in range(len(a)):
    if a[i] != b[i]: print i, a[i], b[i]
6 d g
8 g d

This method has the inconvenience of only allowing strings with the same length.


Difflib is a Python module dedicate to comparing sequences.

difflib.SequenceMatcher is a class dedicated to comparing pairs of sequences.

import difflib
a="Hello dog"
b="Hello god"
s = difflib.SequenceMatcher(None, a,b)

for block in s.get_matching_blocks():
    print block
Match(a=0, b=0, size=6)
Match(a=6, b=8, size=1)
Match(a=9, b=9, size=0)


difflib.Differ compares sequences of lines of text, and produces human-readable differences or deltas. The differences are distinguished by having each comparison line begin with:

'- ' 	line unique to sequence 1
'+ '  	line unique to sequence 2
'  ' 	line common to both sequences
'? ' 	line not present in either input sequence
import difflib
a="Hello dog"
b="Hello god"
d = difflib.Differ()
diff=d.compare(a,b)
diff=d.compare(a,b)
for i in diff:
    print i
  H
  e
  l
  l
  o
   
+ g
+ o
  d
- o
- g


get_opcodes() Return list of 5-tuples describing how to turn a into b

import difflib
a="Hello dog"
b="Hello god"
s = difflib.SequenceMatcher(None, a, b)
for tag, i1, i2, j1, j2 in s.get_opcodes():
    print "{tag} a[{i1}:{i2}] ({a_str}) b[{j1}:{j2}] ({b_str})".format(tag=tag, i1=i1, i2=i2, a_str=a[i1:i2], j1=j1, j2=j2, b_str=b[j1:j2])
equal a[0:6] (Hello ) b[0:6] (Hello )
insert a[6:6] () b[6:8] (go)
equal a[6:7] (d) b[8:9] (d)
delete a[7:9] (og) b[9:9] ()


doing things with diffs

reconstruct and mark

I want to use the opcodes to mark the changes that happen to arrive to the most recent.

#!/usr/bin/env python                                                                     
import difflib                                                                            
# reconstruct be, with marks indicating the changes                                       
                                                                                          
a="Hello dog"                                                                             
b="Hello god, finally"                                                                    
s = difflib.SequenceMatcher(None, a, b)                                                   
x=''                                                                                      
                                                                                          
for tag, i1, i2, j1, j2 in s.get_opcodes():                                               
    if tag is "insert":                                                                   
        b_str=b[j1:j2]                                                                    
        x = x + "**"+b_str+"**"                                                           
#        print 'insert','b[{i1}:{i2}]'.format(i1=i1,i2=i2), b_str                         
                                                                                          
    elif tag is "replace":                                                                
        b_str=b[j1:j2]                                                                    
        x = x + "_"+b_str+"_"                                                             
#        print 'replace','b[{i1}:{i2}]'.format(i1=i1,i2=i2), b_str                        
                                                                                          
    elif tag is 'equal':                                                                  
        b_str=b[j1:j2]                                                                    
#        print 'equal','b[{i1}:{i2}]'.format(i1=i1,i2=i2), b_str                          
        x = x + b_str                                                                     
                                                                                          
    elif tag is 'delete':                                                                 
        b_str=b[i1:i2]                                                                    
        if len(b_str)>0: #to avoid empty deleted parts                                    
            x = x + "-"+b_str+"-"                                                         
#            print 'delete','a[{i1}:{i2}]'.format(i1=i1,i2=i2), b_str                     
                                                                                          
                                                                                          
print x
Hello **go**d_, finally_

This tell us that:

  • "go" was inserted
  • ", finally" was replace

Let's begin with "equals" and "inserts", where "equals" get are not pointed out, but inserts are wrapped in "**". If done manually this would result in.

x = b[0:6] + '**'+b[6:8]+'**' + b[8:9]

yet we can loop the opcodes to produce such results. Notice that the opcodes are printed from the beginning to the end of the strings.