April 1st, 2014

Git Source Code Review: Diff Algorithms

I love you Hoi-En.
To find the differences between two "things" is a difficult task that Git has to perform very often:

Here is a summary of all the algorithms used with an associated genesis.

Part 1: Overview
Part 2: Genesis
Part 3: Architecture
Part 4: Algorithms for DIFF


Diff on text

Find the difference between text files is something Git has to do the most commonly :

The first implementation simply leveraged the system diff executable via a call to popen in show-diff.c. diff is a very famous and ubiquitous tool in the Linux world originally developed in the early 1970 for Unix. Its first version used Hunt–McIlroy algorithm. The core algorithm was later notoriously improved thanks to the work of Eugene W. Myers and Webb Miller, work extensively document in the papers: An O(ND) Difference Algorithm and its Variations by Eugene W. Myers and A File Comparison Program by Webb Miller and Myers.

The two most important metrics for diff operations are:

  1. Speed
  2. Quality

Effort on Speed :

System diff not only consumed a fork/exec, it also required a lot of useless access disks because of temporary data. A builtin diff module was needed and implemented in March 2006 patch which introduced the xdiff folder containing a simplified version of libxdiff:

Use a *real* built-in diff generator

This uses a simplified libxdiff setup to generate unified diffs _without_ doing fork/execve of GNU "diff". HUGE performance improvement and no disk access.

Since libxdiff used Eugene W. Myers's algorithm, the diff strategy was called myers. To this day it is still the default way git perform all diffs on text.

Effort on Quality :

If speed is an easy metric to understand and benchmark, quality is subjective and not intuitive. Take for example two versions of a file where a developer simply added a method functhreehalves in the middle. There are two way to reconstruct the change but one is of much higher quality because it clearly shows what was added where the other seems to indicate something was modified.

 

   
   $diff -y file2 file1
   void func1() {            void func1() {
     x += 1                     x += 1
   }                         }

                           > void functhreehalves() {
                           >   x += 1.5
                           > }
                           > 
   void func2() {            void func2() {
     x += 2                    x += 2
   }                         }


   
 

   
   $diff -y file2 file1
   void func1() {            void func1() {
     x += 1                     x += 1
   }                       > }
                           >
                           > void functhreehalves() {
                           >   x += 1.5
                             }
                           
   void func2() {            void func2() {
     x += 2                    x += 2
   }                         }


   

The discussion started in [PATCH 0/3] Teach Git about the patience diff algorithm where quality between git's myers and bazar's patience were compared. Since patience algorithm seemed to provide a better quality on specific input it was shortly added in a patch that modified further xdiff :

Implement the patience diff algorithm

The patience diff algorithm produces slightly more intuitive output than the classic Myers algorithm, as it does not try to minimize the number of +/- lines first, but tries to preserve the lines that are unique.

You can find more documentation on patience algorithm from the author himself Bram Cohen (yes, the guy who also invented Bittorrent): Patience Diff Advantages and Patience Diff, a brief summary.

Further work was done by people from jgit project (which is a Java/Eclipse based). Their histogram based technique was supposed to be faster than myers while still based on based on Cohen's four rules found in patience (best of both world). A patch was eventually submitted:

teach --histogram to diff

Port JGit's HistogramDiff algorithm over to C. Rough numbers (TODO) show that it is faster than its --patience cousin, as well as the default Meyers algorithm.

Note : So which diff strategy should be used? I personally have always used the default (myers). histogram and patience seem to provide better output on specific case but they are so marginal and poorly documented that it is impossible to know when they will be good withiout trial/errors. If non-obsolete test cases could be provided users could probably make educated decisions.

Note : There is a fourth diff strategy called minimal which is actually an extended version of myers: If no minimal diff is requested the myers algo simply returns. But if minimal is requested, the algorithm will iterate until a certain amount of CPU time has been consumed and keep the "best" diff. Best being the smallest diff. Minimalwas always part of xlibdiff and came with the first "speed" patch mentioned earlier.

Trivia : This need for speed also triggered the creation of a performance testing framework for diff strategies.

Trivia : diff is such an important feature that git developer even allow user to use their own external custom engine via the GIT_EXTERNAL_DIFF variable (Using Git diff to detect code movement).

To finish on text diff, and example of trace trace execution that I found useful:


   main                  (git.c)
   cmd_diff              (builtin/diff.c)
   builtin_diff_files    (builtin/pack-objects.c)   
   run_diff_files        (diff-lib.c)
   xdi_diff_outf         (xdiff-interface.c)
   xdi_diff              (xdiff-interface.c)
   xdl_diff              (xdiff/xdiffi.c)
   xdl_do_diff           (xdiff/xdiffi.c)
        xdl_do_patience_diff
        xdl_do_histogram_diff
        meyes
        meyes mininmal
  

Diff on binaries (deltas)

The initial architecture of Git was: one text file => one sha1 => one deflated blob. The only drawback was HD storage consumption but "disk is cheap" was a fair approach and things worked well this way for a while. But big repositories suffered when being cloned across a network and all those blobs required a lot of disk access.

The topic RFC: adding xdelta compression to git raised those issues and in the end it was decided that storage was cheap but bandwidth and I/O were not. The xdelta lib was briefly considered but a customized version of xlibdiff was favored and added to xdiff.

[PATCH] Deltification library work by Nicolas Pitre.

This patch adds the basic library functions to create and replay delta information. Also included is a test-delta utility to validate the code.

Since binary files miss the CR character used to build atomic units in text algorithm (lines), different heuristic can be used in order to cut the stream. Set length of 48 bytes is an approach but git used adler32 before being replaced later with Rabin's Fingerprinting by random polynomials in an other patch:

replace adler32 with Rabin's polynomial in diff-delta

This brings another small repacking speedup for sensibly the same pack size. On the Linux kernel repo, git-repack -a -f is 3.7% faster for a 0.4% larger pack.

Rabin's finger print technique was a big improvement since the sliding window allowed to detect short addition at the beginning of a zip and keep deltas tiny.

To finish with binary diff, here is the execution path of pack creation :


   main                  (git.c)
   cmd_pack_objects      (builtin/pack-objects.c)
   prepare_pack          (builtin/pack-objects.c)
   ll_find_deltas        (builtin/pack-objects.c)
   threaded_find_deltas  (builtin/pack-objects.c)
   find_deltas           (builtin/pack-objects.c)
   try_delta             (builtin/pack-objects.c)
   create_delta          (diff-delta.c) 
  


Diff on command-line

git sometimes is not interested by the exact difference but rather by the distance between two strings. In 2008 was introduced a "suggestion" feature :

git wrapper: DWIM mistyped commands
This patch introduces a modified Damerau-Levenshtein algorithm into Git's code base, and uses it with the following penalties to show some similar commands when an unknown command was encountered

That is how the dispatcher seen in the architecture part is able to provide suggestions when a typo is suspected :



    $git adt test.txt
    git: 'adt' is not a git command. See 'git --help'.

    Did you mean this?
      add

    

Word suggestion are generated by calculating an edit distance for each known builtin command using levenshtein distance . Found in levenshtein.c , it is a nice example of dynamic programming.

diffcore, xdiff, xdelta ... what ?

Just to summarize (because I was confused when I started) :

Next

To be published: Git internal algorithms for graphs and merge operation.

Comments

 

Fabien Sanglard @2014