Jepsen is an effort to improve the safety of distributed databases, queues, consensus systems, etc. We maintain an open source software library for systems testing, as well as blog posts and conference talks exploring particular systems' failure modes. In each analysis we explore whether the system lives up to its documentation's claims, file new bugs, and suggest recommendations for operators.
Jepsen pushes vendors to make accurate claims and test their software rigorously, helps users choose databases and queues that fit their needs, and teaches engineers how to evaluate distributed systems correctness for themselves.
Together with Aerospike, we validated their next-generation consensus system, confirming two known data-loss scenarios due to process pauses and crashes, and discovering a previously unknown bug in their internal RPC proxy mechanism which allowed clients to see successfully applied updates as definite failures. Aerospike fixed this bug, added an option to require nodes write to disk before acknowledging operations to clients, and plans to extend the maximum clock skew their consensus system can tolerate.
Jepsen demonstrated numerous problems with data loss in Hazelcast, an in-memory data grid: map updates could be lost, atomic references were not atomic, ID generators generated duplicate IDs, locks were not exclusive, and queues could lose acknowledged messages.
Jepsen worked with Tendermint to evaluate their distributed, linearizable, byzantine-fault-tolerant blockchain system. We were unable to find issues with their replication algorithm, but did discover single-node crashes and issues with crash recovery that could lead to unavailability or data loss.
We worked with Cockroach Labs to refine the Jepsen test suite they wrote for CockroachDB, and found multiple bugs leading to serializability violations, all of which are now fixed.
Jepsen helped MongoDB identify design flaws in their v0 replication protocol and implementation bugs in its v1 replacement, all of which could lead to the loss of majority-acknowledged operations. We also collaborated with MongoDB to integrate Jepsen into their CI system. MongoDB added support for linearizable reads in October 2016.
Research for Crate.io led to cases of dirty reads, replica divergence, and lost updates in Elasticsearch.
Jepsen found that document versions in Crate.io do not uniquely identify a particular version of a document, allowing lost updates.
- We worked with VoltDB to discover and fix stale and dirty reads in their SQL database, and, in uncommon configurations, two bugs leading to the loss of acknowledged updates.