Clarifications on the CAP Theorem and Data-Related Errors
Scattered in the larger conversation is a continued mis-perception of my position regarding the CAP theorem. Coda writes “Michael Stonebraker’s assertion aside, partitions (read: failures) do happen.” Others have made similar comments, so let me set the record straight.
I have consistently and repeatedly attempted to make just four points, which I elaborate in this post. The most important point is that using the CAP theorem to justify giving up ACID (consistency) is flawed. In the real world, giving up consistency does not improve availability. Hence, you are giving up consistency in exchange for nothing. This is a horrible engineering tradeoff, and the CAP theorem, therefore, encourages engineers to make awful decisions.
Point 1: The CAP theorem contains an idealized and incomplete model of data-related errors.
I have two main issues with the CAP theorem formulation of errors:
- The model does not deal with several important classes of errors, which a real world system administrator must cope with.
- The model suggests erroneous real-world behavior for at least one important failure mode.
As such, looking to the CAP theorem for real world guidance is highly suspect. Let me explain. The following important sources of outages are not considered in the CAP theorem.
Bohrbugs. These are repeatable DBMS errors that cause the DBMS to crash. In other words, even when multiple data base replicas are available, the same transaction issued to the replicas will cause all of them to crash. No matter what, the world stops, and high availability is an impossible goal.
Application errors. The application inadvertently updates (all copies) of the data base. The data base is now corrupted, and any sane DBA will stop the world and put the data base back into a consistent state. Again, high availability is impossible to achieve.
Human error. A human types the database equivalent of RM * and causes a global outage. There is no possibility of continuing operation.
Reprovisioning. Many current DBMSs (but not all) require down time to reprovision the hardware to provide more (or less) capacity. Again, this is typically a “stop the world” operation.
These are examples of unmaskable outages. They will cause any distributed system to be unavailable. In CAP theorem terms, you simply cannot have availability when issues like the above are present. Discussion about supporting any two of Consistency, Availability and Partition tolerance is irrelevant.
Now let us turn to single node failures. These are considered network partitions by Coda. In Coda’s view, the dead node is in one partition and the remaining N-1 nodes are in the other one. The guidance from the CAP theorem is that you must choose either A or C, when a network partition is present. As is obvious in the real world, it is possible to achieve both C and A in this failure mode. You simply failover to a replica in a transactionally consistent way. Notably, at least Tandem and Vertica have been doing exactly this for years. Therefore, considering a node failure as a partition results in an obviously inappropriate CAP theorem conclusion.
A similar statement can be made about network partitions in which a single node is split off from the remaining N-1 nodes by a network failure.
To summarize my first point, the CAP theorem assumes reliable applications, reliable DBMS, reliable humans and no change in resources. Unfortunately, these are all issues, which must be considered. In addition, modeling node failures as network partitions leads to an obviously bad conclusion.
Point 2: Engineering tradeoffs abound in distributed systems.
When all the errors that can occur are considered, the real question to deal with is “What errors can I mask and at what cost?” This is an engineering tradeoff of run-time cost versus probability of outage. Coda starts to get at this issue at the end of his post. To me the answer to this question crucially depends on the frequency of the various kinds of errors. Unfortunately, this is very sensitive to the actual operating environment. For example IBM’s MVS is considerably more reliable than Linux or Windows. It also depends on how hard it is to mask errors. Specifically, byzantine failures are way more difficult to deal with than “stop cleanly” errors.
Thus, architecting a distributed system is a complex set of engineering tradeoffs, often not amenable to hard and fast statements.
Point 3: The CAP theorem is often used inappropriately to justify engineering decisions.
This is my most important complaint. Many NoSQL systems give up on transactions (ACID), using the CAP theorem as a justification. Specifically, they argue that partitions happen, concluding that you need to sacrifice either availability or consistency. They choose to sacrifice consistency. I believe this is a very poor engineering tradeoff.
The reason is very simple. In my experience, network partitions do not happen often. Specifically, they occur less frequently than the sum of bohrbugs, application errors, human errors and reprovisioning events. So it doesn’t much matter what you do when confronted with network partitions. Surviving them will not “move the needle” on availability because higher frequency events will cause global outages. Hence, you are giving up something (consistency) and getting nothing in return. This point is further explored in my “urban myths” presentation [4].
In effect, this reinforces the undesirability of making engineering decisions based on the CAP theorem.
Point 4: Node speed fundamentally alters the engineering discussion.
Some people argue that next generation systems are going to run on thousands of nodes. In theory, the probability of failures of all kinds will increase. The industry’s fascination with technologies like MapReduce, a fine solution for massively parallel batch operations, seems to fuel the notion that all “big data” solutions must necessarily be deployed on large networks of unreliable servers. For transactional systems this is a poor engineering premise.
Next generation DBMS technologies, such as VoltDB, have been shown to run around 50X the speed of conventional SQL engines. Thus, if you need 200 nodes to support a specific SQL application, then VoltDB can probably do the same application on 4 nodes. The probability of a failure on 200 nodes is wildly different than the probability of failure on four nodes.
The point here is that the speed of local execution definitely matters, and “brute force” on large numbers of nodes is probably a generally bad idea, because it makes failures way more frequent, and therefore more challenging to deal with.
In summary, appealing to the CAP theorem exclusively for engineering guidance is, in my opinion, inappropriate. CAP contains an incomplete error model that can easily skew one’s thinking about what is ultimately a complex collection of engineering tradeoffs. These tradeoffs must also fundamentally consider the speed of your local software and the hardware, the software environment in which you are running and the business criticality of data consistency.
Mike Stonebraker
Founder
VoltDB

This comment, ‘Specifically, byzantine failures are way more difficult to deal with than “stop cleanly” errors,’ reminded me of a lot of work that has been coming out of the IEEE regarding automation paradoxes. If we see complex automation, we will see more complex failures. The system may fail less often, but when it does, recovery also becomes more complex. This has significant business costs where database systems are at issue (and where transportation is at issue, human lives can be lost) which are often unacknowledged. Often times there is a push for more automation rather than less, but at some point there is a question of diminishing returns.
I personally wonder whether there is a maximum complexity that is supportable….
1. There is no distraction on the CAP theorem itself, but it’s interpretation. Most people don’t understand the formalisation by Gilbert/Lynch.
2. How to you proof the “functional correctness” of your DS with the CAP theorem?
3. If Theory and everyday problems are so different, why is it then important for every distributed system engineer?
4. What is speed?
…..
I am tired…
One wrinkle that I don’t hear much discussion of is the value of ACID, which can vary significantly based on the use case.
For Amazon’s business model, “eventual consistency” is probably fine — the worst thing that typically happens is you fill an order where you don’t actually have the item in stock, so you end up having to cancel the order and give some kind of make-good to the customer that was inconvenienced.
Now what if you’re a stock exchange and you let customer A buy 100 shares of stock at $1 from customer B, but by the time you get around to completing the transaction the stock price is now $1.10, so somebody is out $10. Unlike with the Amazon example, you’re not going to be allowed to cancel customer A’s order, so you’re going to have to make up the difference.
Do that a few times, especially with larger orders, and you can lose a lot of money.
I think that maybe a rewording of CAP would suffice. Given CAP you can have a maximum of 2 in any given solution.
“….This is a horrible engineering tradeoff, and the CAP theorem, therefore, encourages engineers to make awful decisions…..”
I don’t agree with this post.
Brewer’s CAP Theorem isn’t a “engineering guidance” but it’s a Mathematical proof.
Brewer’s CAP Theorem has been proven and it’s very important for every Distributed Application Engineer.
The CAP theorem is usefull for proof the ‘Functional correctness’ in distributed systems and not for calculate the distractions of DBA or calculate the number of bugs.
Theory and Everyday Problems are very unrelated topics.
This is why doesn’t care the speed of your local software, or the number of nodes when think about CAP Theory.
The speed of the single node is about the cost of the node (that is exponential not linear) and the number of nodes is about capacity (Total number of CPU, MEMORY or DISK you want).
Your assumption about “Low Number of Node is Good” it’s because the network latency is important, in most case, but if you want a distributed system is because you need something that a monolithic software can’t do.
Jon,
I would certainly be skeptical of anyone claiming to do cross-wan distributed transactions with a latency SLA.
Having listened to architects solving similar problems, my conclusion is that this problem is so heavily impacted by cost, project-level-requirements (DR only? Master-master?), bandwidth, the source and size of incoming data feeds (for example in market applications) – it necessarily requires each organization to make explicit trade-offs.
The inputs to that decision are many – including WAN reliability.
Thanks for the thoughtful comment,
Ryan.
J. Chris,
Interesting use case. Certainly caching/local storage is key in cases like this. I can see the promise of MVCC approaches. I’m curious to see more benchmarks describing performance of write-intensive distributed, replicated MVCC systems.
CouchDB is great – and seems to be appearing in lots of interesting use cases.
Best wishes,
Ryan.
This is a very insightful post. Your point 3 was new to me. I want to add to point 4 that moores law still continues so the available single-node capacity will be enough for more and more apps. It is not true that nowadays most apps are web-scale. The perception of reality on this point has been a little distorted by facebook, twitter and the like. the percentage of apps that will forever only need a single node will converge to 100%.
> network partitions do not happen often
If you define your “system” as the database nodes on the LAN, then you are correct. But CAP is talking about the whole system, not your 4 db nodes on the LAN.
Any web app (or company with multiple buildings) makes a CAP choice every time their WAN has a hiccup. Just because the LAN doesn’t partition doesn’t mean you have solved the problem.
Availability in the theorem requires that all live nodes, including the node that is partitioned off and can’t talk to the others, be available to clients. Vertica & Tandem meet the colloquial, eminently sensible definition of available, but not the CAP theorem’s definition.
Network partitions may well be rare within a single data center, but they are anything but rare in a geographically-distributed, multiple-data center setting. BGP route flaps, general TCP congestion, and a whole host of other things can cause latency to vary wildly across a WAN; for an application that must meet aggressive latency SLAs, it is not uncommon for a cross-data center write to exceed the latency budget, and this happens *many* times per day. Per the CAP theorem paper, a dropped message is indistinguishable from a very-delayed one.
Long-haul asynchronous log-shipping replication is a common solution here, which makes the whole system (even built on RDBMS technologies) an AP (eventually-consistent) system. There are certainly cases where this is not an appropriate engineering tradeoff, as Mr. Stonebraker states, in which case one must employ distributed transactions to ensure multi-data center consistency; perhaps an application can wait arbitrarily long for a WAN network delay to be resolved before returning a response, but there are plenty of cases where that is not an acceptable business outcome (for example, the response times of a website may have a major impact on profitability). To write these cases off oversimplifies the situation.
(And, by the way, I am an architect at a large enterprise).
Let me address these points one at a time.
1) Bugs exist. Whether in the db or in the app layer or in the human’s mind that types rm -rf *. Lets propose BCAP then. Bugs, Consistency, Availability, Partition Tolerance. You still only get to choose 2 unfortunately and Partition Tolerance still can’t be one of them. As for re-provisioning, most NoSQL options I’m aware of allow for this to occur online.
2) Couldn’t agree more and A vs C is just one of a long list. Moreover, as Brewer points out (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.33.411) the better question is how to manage yield and harvest during a partition. Coda points out options such as reads but no writes, eventual consistency, or operations on only part of the data.
3) CAP is only used as a reason to drop ACID when AP is chosen. VoltDB is a prime examples of a CP system that provides ACID. In fact I don’t see why any CP system couldn’t in theory provide ACID. That said, its often not the first thing that these projects strive for.
4) Its an interesting point, if you can greatly reduce the chance of P happening, in this case via reducing the number of nodes necessary then that’s great but the CAP decision still remains
There is an extreme case of partition tolerance that must be considered: disconnected operation.
For users at the edge of the network, latency can be the biggest performance killer. If it takes 1 second or more for each user action to be reflected in application state due to round trip time (mobile web) those seconds add up and users can be frustrated.
However, if you move the database and web application to the mobile device itself, users no longer see network latency as part of the user experience critical path. Latency has been proven to be correlated directly to revenue, because users engage much more readily with snappy interfaces.
Once data is being operated on by the user on the local device, the key becomes synchronization. Asynchronous multi-master replication demands a different approach to consistency, than the traditional model which assumes the database is being run by a central service.
The MVCC document model is designed for synchronization. It’s a different set of constraints than the relational model, but since it’s such a highly constrained problem space it also admits of general solutions and protocols.
It’s my belief that the MVCC document model is closer to the 80% solution for a large class of applications. Storing strongly typed and normalized representations of data is an artifact of our historically constrained computing resources, so it will always be a good way to optimize certain problems.
But for many human-scale data needs, schemaless documents are a very good fit. They optimize for the user, not the computer.