16 November 2010 at 11:33 New backend for the ocamlmq STOMP MQ

Shortly after publishing the ocamlmq STOMP message broker, I wrote a sqlite3 backend that I have been improved as of late, to the point that it is faster and more predictable in its performance. You can find ocamlmq 0.5.0a here (git repos here). It retains all the scalability characteristics of the PostgreSQL backend, and improves on its performance, making ocamlmq an attractive solution if you need a fast, small footprint STOMP broker that can scale to millions of queues/messages/subscriptions, supports message priorities and timeouts (useful for queueing jobs) and provides broadcast semantics (topic subscriptions) with prefix matching.

ocamlmq's current 1500 lines of code give a lot of bang per buck: it can queue and deliver messages synchronously about as fast as (or a bit faster than, on my box) beanstalkd, as well as accept messages asynchronously as quickly as and broadcast topic messages faster than RabbitMQ. Also, since it takes 3 MB mem at startup, you can cache many messages in memory before reaching the >140 MB ActiveMQ needs only to boot :)

Characteristics

The new message store takes a full 350 LoCs (vs. <150 for the PostgreSQL), but they are well spent.

Messages are cached in memory and flushed to the permanent sqlite3 store (with full fsync) when either:

  • more than a given number of messages have been received since the last flush or

  • it's been more than DT seconds since the last flush

Moreover, messages kept in mem can also be written to a binary log as they are received before they are flushed (and fsync'ed) to the permanent store. This way, in-mem (not yet flushed) messages can be recovered if ocamlmq were killed or crashed. You can additionally ensure that the binlog is fsync'ed after each write. This brings you the strongest durability guarantees, for every single message is fsync'ed before the broker acknowledges its reception.W

The PostgreSQL backend used to slow down if there were lots of delivered, yet non-ACKed messages, and disconnection of a client with many non-ACKed messages resulted in CPU peaks, as all the delivered messages were requeued at once. This is no longer a problem in the improved sqlite3 backend.

Performance

Here follow some figures, taken on an oldish dual core AMD Athlon 64 X2 with a 7200 RPM SATA disk. CPU usage at the maximum transfer rates lies between 25 and 50% of one core.

Queueing (asynchronously)

Messages can be queued asynchronously (i.e., either with no receipt or with a receipt sent before the message is saved) at a rate exceeding 60000 msg/s.

Queueing (synchronously, with receipt)

msgmsgs is the number of messages kept in mem before a hard flush.

maxmsgs        throughput       throughput (-binlog)
----------------------------------------------------
  1000         >11000 msg/s      >9000 msg/s
  5000         >12000 msg/s      >9000 msg/s
"infty"        >17000 msg/s     >14000 msg/s

If the binlog is enabled and -sync-binlog in use, the maximum throughput is around 4000 msg/s.

Delivering

Topic messages:             >40000 msg/s  (CPU usage: ~60%)
Queue messages in memory:   >12000 msg/s
Queue messages on disk:      >2000 msg/s  (~2500 msg/s with sqlite3's WAL mode)

14 June 2010 at 11:52 ocamlmq, a small footprint, no-fuss STOMP broker for task queues and IPC

ocamlmq is a STOMP message broker with features that make it especially suitable for implementing task queues and communication between subsystems:

  • persistent queues, scaling over arbitrarily large numbers of queues with constant memory usage (i.e. supports millions of queues)

  • strong durability guarantees: messages are guaranteed to have been saved to disk by the time the sender gets a message receipt

  • message priorities

  • per-subscription prefetch limit for queue messages

  • error handling and ACK timeout: if a subscriber doesn't ACK a message after a (message-specific) timeout, the message will be sent to another subscriber. Messages are also resent automatically if a subscriber dies and its connection expires.

  • topic subscriptions (messages broadcast to all subscribers) with prefix matching

  • support for high numbers of subscriptions: millions of subscriptions pose no problem

  • simple extensions within the STOMP protocol to report the number of messages in a queue and the number of subscribers to a queue or topic

I wrote ocamlmq because I could not get ActiveMQ or RabbitMQ (+ its STOMP adapter) to work the way I wanted. The problems I ran into include:

  • excessive memory footprint (ActiveMQ and RabbitMQ), both at startup (e.g. >120MB for ActiveMQ vs. <3 MB for ocamlmq) and as new queues are created or new topic subscriptions are added

  • bad performance with ActiveMQ's scalable (to thousands of queues) storage backends (KahaDB, JDBC)

  • bad performance in RabbitMQ's topic message dispatch: RabbitMQ was doing a linear scan of the subscription table per dispatch

  • RabbitMQ did not guarantee that persistent messages had been saved to disk before sending the message receipt, which could lead to data loss (clarification 2010-06-15 22:34: messages would only be lost if RabbitMQ or the system crashed; see Jason's comment and my replies)

ocamlmq is written in OCaml. I originally estimated that a simple STOMP message broker with the feature set I wanted could be done in around 2000 lines. It took less than 1000, which later grew to ~1200 as I added new features (prefix topic subscriptions and control messages). Because ocamlmq is relatively small (the core of the broker takes around 500 lines of code), it is easy to extend and there's little space for bugs. The server is fairly efficient and abstracted over a storage backend; currently only PostgreSQL's is implemented (< 150 lines of code).

Limitations

ocamlmq works well in the intended use case (persistent queues and transient topic destinations, with possibly many queues and subscriptions), but it has some limitations which preclude its use in other domains:

  • ocamlmq is not designed to scale beyond several hundred / a few thousand simultaneous connections (it will work, but performance will be affected)

  • there is no flow control for topic messages (in the intended use case, topic messages are assumed to be relatively small and processed fast)

  • messages are limited to 16 MB on 32-bit platforms

  • the PostgreSQL storage backend can only persist a few thousand messages per second (note that ocamlmq allows >50K/s persistent message bursts in async mode)

  • ocamlmq does not support very high message rates (ocamlmq delivers only ~20000 ~40000 messages/second on a 3GHz AMD64 box)

If you need complex routing rules, scalability to many thousand simultaneous connections or other enterprise messaging features, you'll be better served by AMPQ or JMS brokers. ActiveMQ, in particular, is very configurable, so it'll most likely fit the bill if memory consumption and scaling to many subscriptions are not a concern.

Scalability

ocamlmq has been designed to support millions of queues and topic subscriptions without undue memory usage. This table summarizes the time complexity of some STOMP operations:

      SEND to queue           O(log subscribers)
      SEND to topic           O(subscribers + log (total subs))
      SUBSCRIBE to queue      O(log (total subs))
      SUBSCRIBE to topic      O(log subscribers)
      ACK                     O(1)

ocamlmq needs typically around 150 bytes per subscription, so 1 million subscriptions will not take much more than 150 MB of memory (compare to >120MB required by ActiveMQ on startup). No extra memory is needed for queues, so you can use lots of them with no concerns for memory usage.

Read more...

21 July 2009 at 07:49 Efficient low-level VMs implemented in high-level (functional) languages

The 2006 edition of the ICFP programming contest, one of the most enjoyable to date, introduced the Universal Machine used by a fictional society to program around 200 BC. Many people have tried their had at implementing the UM in a variety of languages. This table lists several C, C++ and Haskell implementations.

Even though it is clearly hard to beat C speed-wise here, high-level, functional programming languages like Haskell or OCaml can come quite close in spite of the very low-level nature of the problem. I'm getting 75% of C's performance (1m21s vs. 1m for the "SANDmark" benchmark on a 3GHz AMD64) in OCaml with straightforward code that performs array bound checking (i.e., unlike the other implementations, malicious machine images cannot take over the process). This makes it faster than the best performing C++ implementation on this table, and comparable to other C implementations; obviously, this says more about the C++ implementations than about the language itself, but it shows that they're all in the same league. (BTW, Haskell has improved a lot since GHC 6.5: the "ugly, fast" Fast.hs is only 2.25 times slower than edwardk.c with 6.8.2, and a bit worse with 6.10.3: 2m18s = 2.3X; um6.hs, less harmful to the eye, is 5.5X slower than edwardk.c with 6.10.3, though --- virtually the same as GHC 6.5. I had to change a few lines of code as some APIs have changed since.)

Here's the OCaml code:

Read more...

02 July 2009 at 13:22 Hash tables: separate chaining vs. double hashing

In my earlier finite map benchmarks which compared several hash tables and functional maps (AVL trees and ternary trees), separate chaining (with $$\alpha = 0.47$$) beat double hashing ($$\alpha = 0.42$$) at unsuccessful searches (1% hit rate), but was slower at successful ones.

Theory predicts the following average number of probes for unsuccessful and successful lookups (expressions below): Unsuccessful searches, same load factor Successful searches, same load factor

Separate chaining looks much better here, but the graphs are misleading, because we don't actually care as much about the load factor as about memory usage, so we want to compare the collision resolution schemes when the latter is the same.

If we use a linked list for the collision chain, this represents one word of overhead per item compared to double hashing. For instance, if the load factor with separate chaining is 1, we can afford to have a table twice as big with double hashing for the same amount of memory and a load factor of 50%. In other words, $$N + n = N'$$ where $$N$$ and $$N'$$ are the sizes of the tables for separate chaining and double hashing, and $$n$$ the number of elements. The respective load factors are

\[\begin{eqnarray*} \alpha & = & \frac{n}{N}\\ \alpha' & = & \frac{n}{N'}\\ & = & \frac{n}{N+n}\\ \alpha' & = & \frac{\alpha}{1+\alpha}\end{eqnarray*} \]

The expressions for the average number of probes for unsuccessful and successful searches are (Knuth, TAOCP Vol 3, Section 6.4), for separate chaining

Read more...

02 July 2009 at 10:21 Math typesetting with jsMath

I've added math typesetting support to Ocsiblog, which runs this site, using jsMath: (double-click on the equations to see the sources)

\[\begin{eqnarray*} \nabla.D & = & \rho\\ \nabla.B & = & 0\\ \nabla\times E & = & -\frac{\partial B}{\partial t}\\ \nabla\times H & = & j+\frac{\partial D}{\partial t}\end{eqnarray*} \]

When jsMath is enabled, you'll see in the bottom right-hand corner a button that triggers a control panel allowing you to set several options such as the scale or the rendering mechanism (TeX fonts, image fonts or native Unicode fonts; you can pick the one that works best for you and save the configuration --- the best one is TeX fonts, if you have them, but Unicode fonts is quite decent). For best viewing, you can get the TeX fonts here (the page includes detailed install instructions for Windows, OSX and Un*x).

Here's what the above equations should look like with TeX and native Unicode fonts: TeX Unicode