joolean is currently certified at Journeyer level.

Name: Julian Graham
Member since: 2004-12-07 17:48:29
Last Login: 2017-04-23 01:19:39

FOAF RDF Share This



Recent blog entries by joolean

Syndication: RSS 2.0

gzochi 0.11 is out. Enjoy it in good health.

The major innovation over the previous release is that the client side of the distributed storage engine now releases the locks it requests from the meta server. This wasn't easy to orchestrate, so I want to say a little bit about how it works.

Some context: The distributed storage engine is based on a paper by Tim Blackman and Jim Waldo, and it works a bit like a cache: The client requests an intentional lock (read or write) on a particular key from the server, and if the server grants the client's request, it serves up the value for the key along with a temporary lease - essentially a timeout. For the duration of the lease, the client is guaranteed that its lock intentions will be honored. If it's holding a read lock, no other clients can have a write lock; if it's got a write lock, no other clients can obtain a read or write lock. Within the client, the key is added to a transactional B+tree (a special instance of the in-memory storage engine) and game application threads can execute transactions that access or modify the data in the B+tree just as they would in a single-node configuration. When a transaction completes, new and modified values are transmitted back up to the meta server, but they also remain in the local B+tree for access by subsequent transactions. When the lease for a key expires - and the last active transaction using the affected key either commits or rolls back - its lock is released, and the client must petition the meta server to re-obtain the lock before it can do anything else with that key.

The tricky parts happen at the edges of this lifecycle; that is, when a lock is obtained and when it is released. In both cases, the client's view of available data from the perspective of active transactions must change. When the client obtains a lock, it gains access to a new key-value pair, and when it releases a lock, it loses that access. These changes occur asynchronously with respect to transaction execution: The arrival of a message from the meta server notifies the client that a lock has been granted (or denied) and lock release is governed by a local timeout. It's tempting to try implement these processes as additional transactions against the B+tree, such that when a new key is added or an expired key is removed, the modification to the B+tree occurs in a transaction executed alongside whatever other transactions are being processed at the time. Unfortunately, this can lead to contention and even deadlock, since adding or removing keys can force modifications to the structure of the B+tree itself, in the form of node splitting or merging. What to do, then, given that it's not acceptable that these "system" transactions fail, since they're responsible for maintaining the integrity of the local cache? You could adjust the deadlock resolution algorithm to always rule in favor of system transactions when it comes to choosing a transaction to mark for rollback, but since lock acquisition and release are relatively frequent, this just transfers the pain to "userland" transactions, which would in turn see an undue amount of contention and rollback.

The answer, I think, involves going "around" the transactional store. For newly arriving keys, this is straightforward: When a new key-value pair is transmitted from the meta server as part of a lock acquisition response, don't add it to the B+tree store transactionally; instead, add it to a non-transactional data structure like a hash table that'll serve as a cache. Transactions can consult the cache if they find that the key doesn't exist in the B+tree. The first transaction to modify the value can write it to the B+tree, and subsequent transactions will see this modified value since they check the B+tree before the cache.

Orchestrating the removal of expired keys is more challenging. Because the B+tree takes precedence over the cache of incoming values, it's crtical that the B+tree reflect key expiration accurately, or else new modifications from elsewhere in the cluster may be ignored, as in the following pathological case:
  1. Transaction on node 1 obtains write lock on key A, value is stored in cache
  2. Transaction on node 1 modifies key A, writing new value to the B+tree
  3. Transaction on node 1 releases lock on key A
  4. Transaction on node 2 obtains write lock on key A, modifies it, and commits
  5. Transaction on node 2 releases lock on key A
  6. Transaction on node 1 obtains read lock on key A, new value stored in cache
  7. Transaction on node 1 attempts to read key A, but sees old value from B+tree

After some consideration, I borrowed a page from BigTable, and attacked the problem using the concept behind tombstone records. Every record - even deleted records - written to either the B+tree or the incoming value cache are prefixed with a timestamp. To find the effective value for a key, both the B+tree and the cache must be consulted; the version with the most recent timestamp "wins." In most BigTable implementations (e.g., HBase and Cassandra) there's a scheduled, asynchronous "compaction" process that sweeps stale keys. I didn't want to run more threads, so I prevent stale keys from piling up by keeping a running list of released keys. Once that list's length exceeds a configurable threshold, the next transaction commit or rollback triggers a stop-the-world event in which no new transactions can be initiated, and a single thread sweeps any released keys that haven't since been refreshed. With a judiciously configured threshold, the resulting store performs well, since the sweep is quite fast when only a single transaction is active.

This was a real head-scratcher, and I feel proud of having figured it out despite having skipped all databases courses during my undergraduate coursework. Download my project and try it out for yourself!

gzochi 0.10 is out, and it's a big release, befitting its double-digit minor version, I hope. In particular, it includes an initial version of the meta server, which, as its name suggests, provides services to a cluster of gzochi application servers. In this first iteration, the meta server manages R/W locks on portions of a game's object graph, and feeds temporary copies of the relevant object data to the application server nodes to fuel task execution. In the future, it'll also be responsible for client load balancing and coordinating channel message delivery across the cluster.

I've been working on this release since the beginning of the year, although much of that time was spent brooding over the way these systems work in Project Darkstar and in factoring out some core components of the gzochid container to make them available to the meta server (including a kind of minimal dependency injection framework built on top of GObject - surprised no one's done that before). In the design I ultimately settled on, the native, memory-backed storage engine plays a key role, acting as a cache for object data received from the meta server, which maintains the canonical stores of game data. And so I also spent a lot of cycles examining the performance and concurrency profile of that system, tweaking lock orderings and tree traversal algorithms.

One of the more surprising things I discovered was that locks in the memory storage engine as of 0.9 were too granular, and that this made the store unusually prone to deadlock. As I mentioned in an earlier post, the architecture of the memory storage engine was based on my reading of the Berkeley DB source code - especially the B+tree implementation. There's a lot of stuff in BDB, though, and it's got plenty of features that I was pretty sure I could ignore while building my own transactional data store. One thing I decided to leave out of my implementation was pages. After all, if I didn't need to transfer this data to or from durable storage, then why bother partitioning it into disk bandwidth-friendly chunks? It turns out, though, that pages serve another valuable purpose: They limit the granularity of locks and force concurrent transactions to serialize their writes when accessing keys within some threshold distance, especially with respect to insertions of new key-value pairs. Take the following example:
  1. Transaction 1 attempts to update key a1 (write lock a1)
  2. Transaction 2 attempts to read key a2 (read lock a2)
  3. Transaction 2 attempts to insert key a3 (write lock a3)
  4. Transaction 1 attempts to update key a2 (write lock a2)

In principle, there's nothing about this lock ordering that should lead to deadlock. The two transactions are accessing key a2 in incompatible ways, but that just means that the second transaction might need to wait for the first to complete before it can get a write lock on a2. However, in a B+tree, the interstitial structural nodes - not just the "leaf" key-values - are subject to read/write locking when structural modifications such as inserts are made to the tree. So the sequence of locks above actually means:
  1. Transaction 1 attempts to update key a1 (write lock a1, read lock parent a)
  2. Transaction 2 attempts to read key a2 (read lock a2, read lock parent a)
  3. Transaction 2 attempts to insert key a3 (write lock a3, write lock parent a)
  4. Transaction 1 attempts to update key a2 (write lock a2, read lock parent a)

With this additional bit of context, it becomes clear how the contention between these transactions can lead to deadlock: The first transaction can't make progress until it can add get a write lock on key a2, which the second transaction won't release until it can get a write lock on interstitial parent node a to add the key a3 to it. This pattern of clustered reads and insertions is quite common for gzochi application tasks in which some data from the game object graph is read or updated and then another task is durably scheduled for execution. But this is also where pages can help! By grouping regions of an ordered keyspace into "chunks" of keys that have to be locked in bulk, the granularity of interleaved access between concurrent transactions is effectively capped, and transactions attempting to update or insert proximal keys are forced to serialize. Bad for concurrency, but ultimately good for performance in a latency-sensitive environment where retrying a transaction from scratch hurts. Page locking makes the following change to the lock sequence above:
  1. Transaction 1 attempts to update key a1 (write lock page a)
  2. Transaction 2 attempts to read key a2 (read lock page a)
  3. Transaction 2 attempts to insert key a3 (write lock page a)
  4. Transaction 1 attempts to update key a2 (write lock page a)

In this scenario, because the first transaction's update attempt requires a write lock on the entire page, the second transaction can't acquire that toxic read lock. The practical result is that the two transactions execute in serial, like so:
  1. Transaction 1 attempts to update key a1 (write lock page a)
  2. Transaction 1 attempts to update key a2 (write lock page a)

[Transaction 1 commit: unlock page a]
  1. Transaction 2 attempts to read key a2 (read lock page a)
  2. Transaction 2 attempts to insert key a3 (write lock page a)

After making the switch to pages in the memory storage engine, the rate of deadlock for these kinds of transaction workflows dropped sharply, and is now roughly the same as that of the BDB-based storage engine. ...Which stands to reason.

Check out the release! I'm proud of it.
23 May 2015 (updated 23 May 2015 at 22:30 UTC) »
gccgo and Autotools

As part of a personal project, I wrote a small Go program recently to marshal some data from a MySQL database as JSON over HTTP. I hadn't written a lot of (read: any) Go before this, and I found the process relatively painless and the implementation much more concise than the alternatives in Java, PHP, or Python. However, when I went to integrate my program with the rest of the Autotools build for my project, I ran into some obstacles, mostly related to the incomplete / broken support for Go in the current stable version of Autoconf (2.69). There are apparently fixes for the most obvious bugs coming in Autoconf 2.70, but that release has been in a development for years; and even once released, to the best of my knowledge, it won't include important features like tests for available Go libraries. So I spent a few days working on a better Autotools Go integration, which I'm attaching below.

A few notes to make what follows a bit clearer:

First, Go already has a build system, more or less - it's called go. If you've got a library or a program called "mypackage/myproject/myprog," and you've put the source files in the right directory, you can run...

go build mypackage/myproject/myprog

...and wind up with a working, statically-linked executable. What is the right directory? It's the "src" directory under one of the directories in $GOPATH. The Go project has a good amount of documentation on the topic of code organization, but in brief, the layout of a directory that forms a component of your GOPATH should be:

  • pkg/[COMPILER_]$GOOS_$GOARCH/: Compiled library files go here
  • bin/: Compiled executable files go here
  • src/: Source code goes here, grouped by package

The go command is a front-end for the native go compilers (6g, 6l, 8g, 8l, etc.) as well as for gccgo (via the -compiler flag). It figures out where all the external dependencies are in your $GOPATH and passes flags and libraries to the compilers and linkers. If you run gccgo directly - that is, without using the go front-end - you have to assemble these arguments and paths yourself.

`go build' is the mainstream, nicely literate way of executing a build for .go files, and it's how most people familiar with the language will go about it. However, Autotools' existing support for Go unsurprisingly depends on direct interaction with gccgo, and I wouldn't expect that to change in any near-term releases. `go build' is convenient for fast, iterative builds; I find Autotools-based builds useful for packaging a source distribution for delivery to environments that need to be interrogated to locate dependencies. I wanted my project's build to work both for people doing `./configure && make' as well as for people running `go build'.

The files below provide:

  • A behind-the-scenes patch for the broken `AC_PROG_GO' in Autoconf 2.69
  • A new macro implementation - AC_CHECK_GOLIB - that finds and tests dependencies for Go programs and libraries, and which behaves similarly to pkg-config's `PKG_CHECK_MODULES'.
  • A working example of an Autotools build for a small project that uses Autotools for its build and depends on some common Go web service and database libraries.

Provides the patch and macro implementation. Bundle this file with your project to apply the changes locally, or put it in `/usr/local/share/aclocal' to make it available system-wide.

# Undefine the broken _AC_LANG_IO_PROGRAM from autoconf/go.m4...

m4_ifdef([_AC_LANG_IO_PROGRAM(Go)], m4_undefine([_AC_LANG_IO_PROGRAM(Go)]))

# ...and redefine it to use a snippet of Go code that compiles properly.

[AC_LANG_PROGRAM([import ( "fmt"; "os" )],
[f, err := os.OpenFile("conftest.out", os.O_CREATE | os.O_WRONLY, 0777)
if err != nil {
if err = f.Close(); err != nil {

# Support macro to check that a program that uses LIB can be linked.


# This little "embedded" shell function outputs a list of dependencies for a
# specified library beyond the set of standard imports.

[AS_FUNCTION_DESCRIBE([ac_check_golib_nonstd_deps], [LINENO],
[Find the non-standard dependencies of the target library.])],
[for i in `$ac_check_golib_go list -f {{.Deps}} $[]1 | tr -d [[]]`; do
$ac_check_golib_go list -f {{.Standard}} $i | grep -q false
if [[ $? = 0 ]]; then
echo $i

ac_check_golib_paths=`$ac_check_golib_go list -compiler gccgo \
-f {{.Target}} $2`
ac_check_golib_seeds=`ac_check_golib_nonstd_deps $2`

# Compute the transitive closure of non-standard imports.

for ac_check_golib_seed in $ac_check_golib_seeds; do
ac_check_golib_oldseeds="$ac_check_golib_oldseeds $ac_check_golib_seed"
ac_check_golib_newseeds=`ac_check_golib_nonstd_deps $ac_check_golib_seed`

for ac_check_golib_newseed in $ac_check_golib_newseeds; do
if ! echo "$ac_check_golib_oldseeds" | grep -q "$ac_check_golib_newseed"
$ac_check_golib_oldseeds $ac_check_golib_newseed"

ac_check_golib_seeds="$ac_check_golib_seeds $ac_check_golib_newseeds"
ac_check_golib_path=`$ac_check_golib_go list -compiler gccgo \
-f {{.Target}} $ac_check_golib_seed`
ac_check_golib_paths="$ac_check_golib_paths $ac_check_golib_path"

LIBS="-Wl,-( $LIBS $ac_check_golib_paths -Wl,-)"

[$1[]_GOFLAGS="-I $ac_check_golib_root"
m4_default([$4], AC_MSG_ERROR([Go library ($2) not found.]))

# Attempts to locate a Go library LIB somewhere under $GOPATH that can be used
# to compile and link a program that uses it, optionally referencing SYMBOL.
# Calls ACTION-IF-FOUND if a usable library is found, in addition to setting
# VARIABLE_GOFLAGS and VARIABLE_LIBS to the requisite compiler and linker flags.

AC_ARG_VAR([$1][_GOFLAGS], [Go compiler flags for $2])
AC_ARG_VAR([$1][_LIBS], [linker flags for $2])

AC_MSG_CHECKING([for Go library $2])

if test -z "$ac_check_golib_go"; then

# The gccgo compiler needs the `pkg/gccgo_ARCH` part of the GOPATH entry that
# contains the target library, so use the `go' command to compute the full
# target install directory and then subtract out the library-specific suffix.
# E.g., /home/user/gocode/pkg/gccgo_linux_amd64/foo/bar/libbaz.a ->
# /home/user/gocode/pkg/gccgo_linux_amd64

ac_check_golib_root=`$ac_check_golib_go list -compiler gccgo \
-f {{.Target}} $2`
ac_check_golib_root=`dirname $ac_check_golib_root`
ac_check_golib_path=`dirname $2`


# Save the original GOFLAGS and add the computed root as an include path.

GOFLAGS="$GOFLAGS -I $ac_check_golib_root"

AS_IF([test -n "$3"],
[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([import ("os"; "$2")],[
if $3 == nil {
} else {

# Did it compile? Then try to link it.

[_AC_LINK_GOLIB([$1], [$2], [$4], [$5])],

# Otherwise report an error.

m4_default([$5], AC_MSG_ERROR([Go library ($2) not found.]))])],

# If there was no SYMBOL argument provided to this macro, take that to mean
# this library needs to be imported but won't be referenced, so craft a test
# that exercises that kind of import clause (i.e., one with the `_'
# modifier).

[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([import ("os"; _ "$2")],
[_AC_LINK_GOLIB([$1], [$2], [$4], [$5])],
m4_default([$5], AC_MSG_ERROR([Go library ($2) not found.]))])])

# Restore the original GOFLAGS.


Food for Autoconf. Note the call to `AC_CONFIG_MACRO_DIR' to make golib.m4 visible.

AC_INIT([My Cool Go Program], [0.1], [], [myprog], [])



AC_CHECK_GOLIB([MARTINI], [], [martini.Classic])
AC_CHECK_GOLIB([GORM], [], [gorm.Open])
AC_CHECK_GOLIB([RENDER], [], [render.Renderer])


Food for Automake. From what I can tell, Automake has no explicit support for building Go programs, but it does include general support for defining build steps for arbitrary source languages. Note the use of the ".go.o" suffix declaration to specify the compilation steps for .go source files and the "LINK" variable definition to specify a custom link step. The `FOO_GOFLAGS' and `FOO_LIBS' variables are created by the expansion of `AC_CHECK_GOLIB([FOO]...)' in

bin_PROGRAMS = myprog

myprog_SOURCES = src/mypackage/myproject/myprog.go

myprog_DEPENDENCIES = builddir
myprog_LINK = $(GOC) $(GOFLAGS) -o bin/$(@F)

if [ ! -d bin ]; then mkdir bin; fi

$(GOC) -c $(myprog_GOFLAGS) -o $@ $<

CLEANFILES = bin/myprog

Here's a snippet from the output of configure when the integration is working:

checking for gccgo... gccgo
checking whether the Go compiler works... yes
checking for Go compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking for go... /usr/bin/go
checking for Go library yes
checking for Go library yes
checking for Go library yes
checking for Go library yes

To apply this code to your own project, copy / adapt the snippets above along with your own Go code into the following directory layout.

If you add this project root to your GOPATH, you should be able to run `go build package/project' in addition to `./configure && make'.

Problems? Let me know.
16 Feb 2015 (updated 16 Feb 2015 at 18:11 UTC) »
Transactional B-trees

The next version of gzochi is going to include a new storage engine implementation that keeps data entirely in memory while providing transactional guarantees around atomicity of operations and visibility of data. There are a couple of motivations for this feature. The primary reason is to prepare the architecture for multi-node operation, which, as per Blackman and Waldo's technical report on Project Darkstar, requires that the game server -- which becomes, in a distributed configuration, more like a task execution server -- maintain a transient cache of game data delivered from a remote, durable store of record. The other is to offer an easier mechanism for "quick-starting" a new gzochi installation, and to support users who, for political or operational reasons, prefer not to bundle or install any of the third-party databases that support the persistent storage engines.

That first motivation wouldn't bias me much in either direction on the build-vs-buy question; Project Darkstar's authors almost certainly (planned) to implement this using Berkeley DB's "private" mode (example here). However, gzochi is intentionally agnostic when it comes to storage technology. The database that underlies a storage engine implementation needs only to support serializably isolated cursors and reasonable guarantees around durability; requiring purely in-memory operation would be a heavy requirement. And I feel too ambivalent about Oracle to pin the architecture to what BDB supports, AGPL or no. (The Darkstar architects should have been a bit warier themselves.) So I settled on the "build" side of the balance. ...Although my first move was to look for some source code to steal. And I came up weirdly short. The following is a list of the interesting bits and dead ends I came across while searching for transaction B-tree implementations.

Some more specific requirements: There are two popular flavors of concurrency for the kind of data structure I wanted to build with the serializable level of transactional isolation I wanted to provide. Pessimistic locking requires that all access to the structural or data content of the tree by different agents (threads, transactions) be done under the protection of an explicit read or write lock, depending on the nature of the access. Optimistic locking often comes in the form of multi-version concurrency control, and offers each agent a mutable snapshot of the data over the lifetime of a transaction, mostly brokering resolutions to conflicts only at commit time. Each approach has its advantages: MVCC transactions never wait for locks, which usually makes them faster. Pessimistic locking implementations typically detect conflicting access patterns earlier than optimistic implementations, which wait until commit to do so. Because gzochi transactions are short and fine-grained, and the user experience is sensitive to latency, I believe that the time wasted by unnecessary execution of "doomed" transactional code is better avoided via pessimistic locking. (I could be wrong.)

Here's what I found:
  • Apache Mavibot - Transactional B-tree implementation in Java using MVCC. Supports persistent and in-memory storage. Hard for me to tell from reading their source code how their in-memory mode could possibly work for multi-operation transactions.
  • Masstree - Optimistically concurrent, in-memory non-transactional B+tree implementation designed to better exploit SMP hardware.
  • Silo - Optimistically concurrent, in-memory transactional store that uses Masstree as its B-tree implementation.
  • SQLite - Lightweight SQL implementation with in-memory options, with a transaction-capable B-tree as the underlying storage. Their source code is readable, but the model of threads, connections, transactions, and what they call "shared cache" is hard to puzzle out. The B-tree accumulates cruft without explicit vacuuming. The B-tree code is enormous.
  • eXtremeDB - Commercial in-memory database with lots of interesting properties (pessimistic and MVCC modes, claimed latencies extremely low) but, you know, no source code. So.
Because I was unable to find any pessimistic implementations with readily stealable source code, I struck out on my own. It took me about a week to build my own pessimistic B+tree, using Berkeley DB's code and architecture as a surprisingly helpful guide. My version is significantly slower than BDB (with persistence to disk enabled) but I'm going to tune it and tweak it and hopefully get it to holler if not scream.

gzochi 0.7 is out. Get it here.

This was a tough release to get out, in no small part because I'd decided to provide a suite of data manipulation tools to support the practical administration of a gzochid server. I wanted to make it easy to export and import data, for the purposes of backup and restore, and to perform large-scale schema transformations of serialized data, to support changes to data structures that occur as part of the evolution of a game application.

The first two were (relatively) easy. I looked for prior art and found some in the utilities that ship with Berkley DB, the most mature (and as of last year, the most appealingly licensed) of the storage engines the server supports, db_dump and db_load. It was pretty easy to make some higher-level ones that do the same thing (read and write database rows in a portable format) in terms of gzochid's abstract storage API: gzochi-dump and gzochi-load.

Migrating data is a different story. It's not enough to process the stream of rows as they're emitted from the object store. The objects that make up the state of a game form a graph, so you need to traverse that graph, marking nodes as they're visited. It also means that migrations must be done mostly in-place; fully transforming any single reachable node in the graph may require adding new nodes, removing existing ones, or changing the structure of other parts of the graph. Furthermore, the nature of the transformation to be done complicates the process: Each row of data is effectively a plain byte array, the only metadata being a prefix that provides a key into the application's registry of types; the rest of the data has no structure outside of what the application's serialization code applies to it. To transform the data but preserve its "type," two different serializers must be used -- one to read the original data in, the other to write the transformed data back out. This is indeed a problem RedDwarf Server faced because of its reliance on Java's built-in object serialization mechanism. Some rather pointed discussion of the problem can be found in this forum thread. Someone on that same thread mentions ICE and its sub-project, Freeze, which apparently solves a similar problem.

I did a little reading on these technologies, and while they didn't turn out to be -- nor did I expect them to be -- a fit for my needs, they got me thinking about how migrating a data set like this one is a complex enough operation that it might need some first-class modeling, beyond just defining the logic of the transformation. The solution I wound up with involves XML-based descriptions of input and output (read: deserialization and serialization) configurations and provisioning real storage for the state of the migration itself. Doesn't feel like too much; hope it's enough.

85 older entries...


Others have certified joolean as follows:

  • lerdsuwa certified joolean as Apprentice
  • badvogato certified joolean as Journeyer
  • mako certified joolean as Apprentice
  • aicra certified joolean as Apprentice
  • lkcl certified joolean as Apprentice
  • ara0bswft16 certified joolean as Apprentice
  • dangermaus certified joolean as Master

[ Certification disabled because you're not logged in. ]

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

Share this page