[cdv-devel] The new Codeville merge algorithm

Bram Cohen bram at bitconjurer.org
Thu May 5 13:22:22 EDT 2005


(Also posted to the revctrl mailing list, since this is of interest to
people working on other revision control projects, see
http://lists.zooko.com/mailman/listinfo/revctrl )

I've been discussing the Codeville merge algorithm with a few people on
#revctrl recently, most notably Nathaniel Smith from Monotone and Ross,
who does most of the real work on Codeville.

We have a new version of Codeville merge mostly planned out now. I'm going
to explain how Codeville merge currently works, what the problems are, the
reasoning behind the new merge algorithm, and how the new merge algorithm
works.

Codeville merges by doing a two-way merge between files, then using
history information to decide which side of each differing section of code
wins. Basically it annotates both sides, and checks to see if the
revisions those lines came from have already been applied to the other
side, and if they have then the side which already has the revisions wins.
If neither side wins, it's a conflict and gets escalated to the user,
using cvs-style conflict annotations in the file.

Usually this works quite well, but it can screw up pretty badly. Since
it's just doing a two-way merge lines can erroneously line up with other
lines with a completely different lineage. It also can lead to ambiguous
clean merges - situations in which *both* sides should win. It turns out
these do happen, and they've been causing some problems. Codeville
currently has a fix for ambiguous clean merges which is 'good enough for
engineering', but we'd like something better. The other case related to
ambiguous clean merges is coincidental matches, where two similar-looking
lines with different lineages are matched up to each other. That one also
causes problems.

Our goals for the new merge algorithm are -

(1) get rid of ambiguous clean merges

(2) get rid of coincidental matches

(3) add the property that a series of clean merges will always have the
same result regardless of what order the files were merged in

(4) make merge generally more mathematical, so that things can be proven
about it

There are upcoming feataures which we hope are easier to build from the
new algorithm, but the immediate goal here is to clean up what we've got,
because that leads to some well-defined criteria and we have a general
feeling that a well-behaver merge algorithm will make everything easier.

One thing which is absolutely NOT a goal is to have a pathway to
supporting reordering lines within a file. We've been going over lots of
wacky edge cases, and have generally come to the conclusion that the
frequency with which reordering lines within a file breaks things more
than obliterates the benefits it gets by avoiding unnecessary conflicts,
even from a strictly behavioral standpoint, ignoring implementation
difficulties. A VCS which knows more about what its tracking than just
that it's a series of lines might be able to do moves in a reasonable way,
but that's a research topic we aren't even gonna touch for the foreseeable
future.

Now, for the new merge algorithm.

We will give each line an identifier, and only allow it to be matched up
with itself, regardless of how much the file has shifted around and how
similar it may look to other lines with a different lineage. This approach
is inspired by the very nice line alignment which Darcs has, in fact we've
been referring to the new merge algorithm as the 'precise-alignment
merge'. It may be that as long as you only ever hit clean merges the new
Codeville merge is isomorphic to Darcs merge.

A full list of lines which have been deleted will be kept in each
revision, and when two revisions are merged together lines which have been
deleted from either side must be deleted in the result. Perhaps in the
future there will be a concept of un-deleting a line, but determining
those implicitly is very prone to error, so for now once a line is dead,
it stays dead.

Two lines will never be allowed to appear in different orders in different
revisions (that would nearly guarantee an ambiguous clean merge). This
leads to an interesting case. If there's an ancestral version which is ab
(we tend to use strings where each character represents a line when giving
examples) and one person makes aXb and another makes aYb, is aXYb
possible? Obviously we can't have both aXYb and aYXb allowed, but one or
the other doesn't cause problems. In ther interests of overall simplicity,
we've decided to make a full ordering of all lines. The full ordering, as
I'll explain in a second, makes merging a lot easier, and that's the
motivation for doing it. Allowing aXYb is bonus.

Before going on about the wonderful new merge technique which a full
ordering enables, I'll give an example of a full ordering algorithm which
works, since it isn't all that obvious that there is one.

The criteria for a full ordering are (1) if two lines appear in any file
at the same time, they must appear in that order in the full ordering, and
(2)
For each line, give it an ordering identifier, which is a tuple used to do
'tiebreak' between lines. To compare two ordering identifiers, first the
first element is compared, and if those are the same then the second one
is compared, and if those are the same the third ones are compared. The
tuple elements are, first, the longest number of steps between the new
revision and the history root, second the new revision identifier, and
third the line number within that revision. When new text is inserted,
it's put between the lines before and after it in the full ordering. When
comparing merging two versions together, we combine their full orderings
first by separating out lines based on what flanks them, so that if line X
appears on both sides then a section which appears before X doesn't have
to be compared to a section after X. For sections which are flanked the
same way on either side we first determine their tiebreak ordering, and
then mix. Mixing is done by successively taking the smallest number which
appears anywhere in either section, then pulling everything up until that
line to the beginning, and iterating on the result. For example if we're
mixing 513 and 42, then first the 51 gets pulled out, then the 42, then
the 3, so the full ordering is 51423.

We aren't especially fond of this full ordering technique, but it does
work. We'd really like one which 'clumps' better - that is, if you have
two different branches, and merge them together, all lines from one side
should come after all lines from the other in the full ordering. Finding a
better ordering technique is an interesting problem, but we have one which
is perfictly serviceable for now.

Once we have a full ordering, merge becomes surprisingly easy. Simply
combine line states, and pull out all lines from the list. Line states
are: not born, living (inserted), and dead (deleted). If a line is dead on
either side, it's dead in the result, and if it's living on either side
but dead on neither it's alive, it's alive in the result. If it's not born
on both sides, it's not born in the result. To determine if there's a
conflict, we look over each section between lines which are alive on both
sides, and see if there's indicators that both sides must win in that
section. Winners are determined by the following table. The letters
correspond to unborn, living, and dead, '<' means the left must win, and
'>' means the right must win.

uu
ul  >
ud
lu  <
ll
ld  >
du
dl  <
dd

Note that if someone adds a diagnostic line and then deletes it then that
doesn't cause a merge conflict later, which is kind of a neat feature.

Since we have a full ordering, we might as well store all lines in that
ordering. And we've now reinvented the weave format (I literally went
through this process, not knowing about weaves until the whole merge
algorithm was done). Weaves make it really quick to retrieve any revision
of history - just go through a graph of history which gives newly deleted
and inserted lines only, make the list of what's still left, then run
through the list of all lines ever and pull out the ones in this revision.
Quite simple, speedy, and space-efficient.

The other thing we need an algorithm for is resolution. Given the old
versions of a file and the new version, determine what lines map where. We
aren't sure of the best way to do this yet, but probably the best approach
is to find lines which only occur exactly once (comparing by line content,
not identifier) in both the child and the ancestors, then do a longest
common substring on those, then extend the selected lines out as
continuing matches in either direction. I just recently rewrote the
current two-way match code to be based on this approach, and it resulted
in a big improvement in both speed and behavior.

-Bram



More information about the Codeville-devel mailing list