Shaun's Mildly Coherent Ramblings

home

Git is a Directed Acyclic Graph

17 Feb 2014

I’ve spent some time in the past explaining Git to new users who have experience using Subversion and CVS, and after an amount of trial and error I have found what I believe is the easiest way for a developer to quickly understand how Git works and why it is useful.

There are many sources online which do an excellent job explaining the basics of Git, but very few that talk about the underlying model and why it’s important to understand. I believe this is so important that you can’t move beyond a linear understanding of Git without understanding the model on some level.

I am surprised that this idea isn’t more widespread. That your underlying model is a DAG isn’t something you’d normally expect to see in documentation, but consider Git’s target audience — developers. Since we all likely know what a DAG is already, why not use that to our advantage in understanding how Git works?

If you’re unfamiliar with the term, take a moment to read its name: Directed Acyclic Graph. It’s a very high quality name that explains the concept in three words. However, there’s a deeper explanation on Wikipedia if you’d like to read more.

What is a Version Control System?

Think for a moment about what defines a VCS. Among the commonly listed features are:

  • Release Versioning
  • Team Collaboration
  • Merging and Conflict Resolution

These are all great problems to have solved, but don’t define what VCS actually is. In the simplest form, a VCS is a model of the history of a project. Workflow and release versioning are great side effects of this, but whatever VCS you use should accurately represent what has happened with your project.

This is where the non-linear model employed by Git shows its strength. In any team with multiple concurrent streams of work, the purely linear model offered by Subversion is not an accurate record of what happened. Consider how common behaviours manipulate the history:

  • When you author an ordinary commit, you’re creating a new set of changes with a parent
  • When you merge a number of branches, you’re creating an empty commit with multiple parents
  • When you resolve a merge conflict, your merge commit has changes describing the resolution

In a VCS with immutable commits (and they all should be!), it’s intuitive that no commit can be its own ancestor, and so a DAG is a perfect fit for understanding the VCS model. Git takes this even further, and uses a DAG as its underlying data structure.

The Commit

Diagram showing 3 Git commits. a ← b ← c

In this diagram (and all diagrams I use in this post), I have reversed the direction of the arrows from what you would typically expect in a VCS article. The reason for this is made clear by taking a look inside a real Git repository, set up as follows:

$ git commit --allow-empty -m 'a'
[master (root-commit) f40c82c] a
$ git commit --allow-empty -m 'b'
[master 82f7ee0] b
$ git commit --allow-empty -m 'c'
[master 25ce70f] c

There’s one piece of information that git log normally excludes, which I’ll show using a custom output format: parent hashes. To get this output I’ve defined an alias in my ~/.gitconfig:

[alias]
log-parents = log --graph --pretty=format:'commit:  %h%d - %s%nparents: %p%n'

Now, we can see the parent of each commit listed explicitly from the command line:

$ git log-parents
* commit:  25ce70f (HEAD, master) - c
| parents: 82f7ee0
|
* commit:  82f7ee0 - b
| parents: f40c82c
|
* commit:  f40c82c - a
  parents:

It should already be clear why I showed the arrow from each commit pointing at its parent. This is how Git works. Each commit points to its parent(s), and since no commit can be its own ancestor, we intuitively have a DAG.

The Merge Commit

Diagram showing a merge commit 'f' with two parents

This is where things start to get more interesting. Since a merge commit joins multiple branches together into one, it must have multiple parents. We can demonstrate this on the command line:

$ git checkout -b other 82f7ee0
Switched to a new branch 'other'
$ git commit --allow-empty -m 'd'
[other 1e821cf] d
$ git commit --allow-empty -m 'e'
[other 92b0992] e
$ git checkout master
Switched to branch 'master'
$ git merge other -m 'f'
Merge made by the 'recursive' strategy.

By inspecting on the command line, we can see the resulting structure is the same as the diagram above, and f now points at its two parents after the merge:

$ git log-parents
*   commit:  4b084ec (HEAD, master) - f
|\  parents: 25ce70f 92b0992
| |
| * commit:  92b0992 (other) - e
| | parents: 1e821cf
| |
| * commit:  1e821cf - d
| | parents: 82f7ee0
| |
* | commit:  25ce70f - c
|/  parents: 82f7ee0
|
* commit:  82f7ee0 - b
| parents: f40c82c
|
* commit:  f40c82c - a
  parents:

The same applies when merging more than two commits.

Cherry-Picking a Commit

Diagram showing commit 'e' cherry-picked onto master

Starting with the same initial state as before the merge commit above, we can use the cherry-pick command to copy the e commit into master:

$ git cherry-pick other --allow-empty
[master 6fdfd4e] e

Following this, the structure shows that the commit has been copied, and the original commit is still intact.

$ git log-parents master other
* commit:  6fdfd4e (HEAD, master) - e
| parents: 25ce70f
|
* commit:  25ce70f - c
| parents: 82f7ee0
|
| * commit:  92b0992 (other) - e
| | parents: 1e821cf
| |
| * commit:  1e821cf - d
|/  parents: 82f7ee0
|
* commit:  82f7ee0 - b
| parents: f40c82c
|
* commit:  f40c82c - a
  parents:

Not a command you’d use very frequently, but still a very useful tool to have available.

Amending a Commit

Diagram showing commit 'c' amended to form commit 'g'

When a commit is amended, such as c in this diagram, the original commit is not destroyed or modified. Instead it is left dangling — it points at its parent, but nothing points at it. The new commit g is created with the same parent as the original commit, but is an entirely new commit.

After resetting my master back to c, amending a commit looks like:

$ git reset --hard 25ce70f
HEAD is now at 25ce70f c
$ git commit --allow-empty --amend -m 'g'
[master 7649ced] g

Notice that it’s a new commit with a new commit hash. It has the same parent and the same content, but a different message and a different hash. The graph (including the old commit c) now looks like:

$ git log-parents master 25ce70f
* commit:  7649ced (HEAD, master) - g
| parents: 82f7ee0
|
| * commit:  25ce70f - c
|/  parents: 82f7ee0
|
* commit:  82f7ee0 - b
| parents: f40c82c
|
* commit:  f40c82c - a
  parents:

The old c commit was not modified, and if needed we could retrieve the commit as it existed a moment ago.

Rebasing a Branch

What’s remarkable about this DAG model is that a rebase is simple to understand when you can visualise it as a graph operation – copying nodes from one location to another.

A rebase can be performed in two separate but very similar ways. The simplest is conceptually similar to a merge, but rather than creating a merge commit to join branches, it copies the tail of one branch onto the tip of another:

Diagram showing branch 'master' rebased onto branch 'other'

The initial state is as you’d expect from the diagram above. Notice that I’ve created a branch at the merge base of the two branches, which is also shown in yellow in the diagram:

$ git branch merge-base $(git merge-base master other)
$ git log-parents master other
* commit:  f82ecf8 (HEAD, other) - c
| parents: 5a34a25
|
| * commit:  959b4ef (master) - e
| | parents: e543f88
| |
| * commit:  e543f88 - d
|/  parents: 5a34a25
|
* commit:  5a34a25 (merge-base) - b
| parents: 4eb393f
|
* commit:  4eb393f - a
  parents:

When rebasing in this way, I find it useful to think of the parameter order as a concatenation, just to remember which order they take:

$ git rebase other master
First, rewinding head to replay your work on top of it...
Applying: d
Applying: e
$ git log-parents
* commit:  d75f3a0 (HEAD, master) - e
| parents: d3e7e6a
|
* commit:  d3e7e6a - d
| parents: f82ecf8
|
* commit:  f82ecf8 (other) - c
| parents: 5a34a25
|
* commit:  5a34a25 (merge-base) - b
| parents: 4eb393f
|
* commit:  4eb393f - a
  parents:

So now master has been rebased onto other, and the history between the two branches has been made linear.

The other form of rebase takes a range of commits, and copies them to a different location:

Diagram showing branch 'master' rebased explicitly to remove a commit

Though this diagram shows it as the inverse of what we just did, in actual fact it’s the same operation. Previously, the other branch served two purposes:

  1. Selecting the merge base
  2. The branch to rebase onto

We can override the branch we’re rebasing onto (or any commit ref – branch not required), which means we can take any range of commits, and copy them anywhere in the graph. We can use this to revert what we did above:

$ git rebase --onto merge-base other master
First, rewinding head to replay your work on top of it...
Applying: d
Applying: e
$ git log-parents master other
* commit:  83ea7a1 (HEAD, master) - e
| parents: e1db73c
|
* commit:  e1db73c - d
| parents: 5a34a25
|
| * commit:  f82ecf8 (other) - c
|/  parents: 5a34a25
|
* commit:  5a34a25 (merge-base) - b
| parents: 4eb393f
|
* commit:  4eb393f - a
  parents:

Closing Thoughts

Git’s model is actually quite simple when you understand what’s going on. Some of the biggest sticking points I’ve seen in understanding Git have been:

  • Trying to understand Git by drawing comparisons to Subversion isn’t useful, because Subversion lacks many of the concepts that are found in Git. The only thing the two have in common is that they both have a command called commit (which has different behaviour in each)
  • Different Git commands can have a totally inconsistent syntax, which adds to the learning curve. For example, compare git-remote to git-branch
  • The commands which manipulate the index and the working copy can take time to learn
  • There are a lot of commands. Most of them you’ll hardly ever use, but the list is overwhelming

Truth be told, it is a more complex system than Subversion, but that’s out of sheer necessity because it’s so much more powerful. Like anything, the best approach is to learn enough to be mostly productive, and take the plunge.