How Hard Could That Be?

If you're about to use the database cluster to scale performance of your application, you need some way to balance the load between the nodes. Ideally application should be smart enough to be able to distribute it's connections between different hosts and fail over. But in practice this is rarely the case. (And it is rarely practicable) Normally what you want is a virtual server that will do this load balancing on behalf of the application. Not only does this simplify the application and its setup, it also allows for a really smart load balancer.

Here comes LVS. But LVS has some of annoying limitations:

  • You need a separate box with two NICs to make use of it. There is a so called "local" mode, but you still can't run it on the client machine. This one is a real showstopper when you have limited resources.
  • Its management is coarse and as far as I could figure out it does not support "draining" of nodes.
  • Setting it up is not for the weak of heart and requires a tremendous expertise in Linux networking. (I failed to get more than a couple hundred packets a second from it - and not a single error or retransmission in TCP stream. It worked correctly but slowly. Apparently some of the hundreds of kernel options just made it route packets with delay.)

Basically LVS is not something that "just works" and you never ever would want to tell your customer: "you just have to set up LVS" Unless you want to lose him.

MySQL Proxy is just a CPU hog. Besides it works only with MySQL. Perhaps it has some uses, but not as efficient load balancer. We have discarded it quickly.

Contrast this to Pen - easy to use userspace TCP load balancer. It JUST WORKS. It still does not have a concept of node "drain", but at least you don't have to think a second to get it running. So it has become our TCP router of choice during development.

However during benchmarking we have noticed that it consumes actually more CPU than the client application(!), and client machine usually got saturated when benchmarking a cluster of more than two nodes! Here goes our benchmarking - load balancing bottleneck. I must say I was quite confused. How come load balancer whose job is just to read and write packets consumed more CPU than the client application (regardless of actual application, mind you!). Perhaps this Pen is way too feature rich, I thought. It does something unnecessary, that must be the reason...

How hard could it be to write a userspace TCP load balancer, that would just read from socket and write to socket?

So I set to do it. The target was to write a balancer, that would:

  • be as simple to use as Pen.
  • support node "drain".
  • scale on SMP machines.

And of course it got to be as CPU efficient as only possible. Few days of coding and here comes glbd (Galera Load Balancing Daemon, for lack of fantasy). All new and shiny. But wait a minute! What's that? It uses as much CPU as Pen? But how could that be??? I made it myself! It does nothing, but reading and writing! The actual loop is only few lines of code! This can't be happening!

Well, actually, according to top, it IS more efficient than Pen. By some 8%. And thanks to being able to run several threads even on a single core CPU it provides slightly lower latency than Pen (select()'s fire a little bit faster). But still, what a hell is going on? I added some statistics logging...

And from the looks of it, efficient TCP load balancing for MySQL (and probably for other SQL servers) is impossible. The whole point of the database is to do as much work as possible on site and minimize the amount of data sent back and forth, so SQL packets are usually quite small. For sqlgen (our synthetic load generator) average size of packet is 116 bytes! For DBT2 it is about 220 bytes. It is nice that these packets are so small. The problem is that operating system as it is is unsuitable for routing many packets through userspace. Basically each packet requires a separate select() call: the average number of file descriptors set per select() is 1.2. So each SQL command requires 1 select(), 1 recv() and 1 send(). If you want 10000 SELECT's per second, get ready to handle 20000 select()s, 20000 recv()s, and 20000 send()s (command + reply), total of 60000 system calls per second.

At this rate even splice() for TCP that appeared only in 2.6.25 would not help much. There is a fundamental overhead of 3 system calls per packet. Mind you, if you want to stream some video, it is not a problem - the packet rate is much lower. But in database applications it has to be that high. And I must take my words back and acknowledge that Pen developers did a really great job.

Now, how hard could that be to make a system call that would just connect two sockets and use the input buffer of one as the output buffer for another? Why splice() half-measures? Why do I have to call select() (or any of its brothers) and then iterate over all my files descriptors to check which one was set at all? I just want to pipe two sockets and forget about them! Splice my ass, what were the authors of splice() thinking? Efficiently copy data without getting it to userspace? Well, if the data (plural, right?) don't get to userspace, what is the purpose of select() and two subsequent splice() calls? Obviously, if application uses splice() it does not care what's in the data. And as such it does not care when it arrives. So this sequence select(), splice(), splice() is totally redundant. The kernel can do the job without notifying the application. But it seems, the idea is: we splice() because we can!