zombees - Accelerated SSL

May 3, 2011

There are any number of ways to improve the performance of SSL, ranging in difficulty from a simple config change to installing dedicated hardware or desperately trying to get the protocol modified.

This article focuses on accelerating the performance of SSL by improving OpenSSL’s hash, block and key algorithm performance. By increasing the data which you can transmit and the number of SSL handshakes that can be performed, you can reduce the overhead of SSL on your current server hardware.

Beyond configuration and implementation improvements, I will also present two additional solutions: adding an SSL accelerator card and modifying OpenSSL to utilize Intel’s IPP library.

OpenSSL version

OpenSSL 1.0.0d (or newer) provides a few important performance improvements over the older, and more commonly deployed 0.9.8 series:

SSL_MODE_RELEASE_BUFFERS — if your application supports it (or you patch it to do so), memory utilization per SSL connection is drastically reduced
RSA performance — significantly increased, particularly on the 32-bit x86 arch

As a result, you should strongly consider upgrading if you haven’t already done so. RHEL 6 already provides OpenSSL 1.0.0, older versions have 0.9.8. However, you will need to take some further steps outlined below to correct performance degradations found on Nehalem and newer processors with RC4 and AES.

Architecture

Many of the changes presented below are dependent upon the system architecture you use. Only 32-bit x86 and 64-bit x86_64 have been compared, other architectures (SPARC, MIPS, etc.) will have very different performance characteristics.

If possible, you should ensure that your OS, application and OpenSSL are all 64-bit. RSA handshake performance is 2x better on x86_64, and for any high traffic site you will need the additional memory footprint that a 64-bit process provides.

Because of the specialized AES-NI instructions added in Intel Xeon Westmere (56xx series) and newer processors, it is highly desirable to utilize these processors in your hosts serving SSL traffic.

Ciphersuite selection

Based on a previous article, two ciphersuites have been selected:

TLS_RSA_WITH_AES_128_CBC_SHA: RSA + AES-128-CBC + SHA1
TLS_RSA_WITH_RC4_128_SHA: RSA + RC4 + SHA1

You may also choose to permit TLS_RSA_WITH_AES_256_CBC_SHA and TLS_RSA_WITH_RC4_128_MD5. All of the other current TLS ciphersuites are inferior from a performance or security standpoint, or are not sufficiently widely supported to consider now.

These ciphersuites contain three algorithms, each of which needs to be accelerated to improve overall performance:

Hash function (MAC)
Block cipher (bulk encryption)
Key exchange

Hash function

The two ciphersuites we chose both use the same hash function: SHA-1. MD5 and SHA-256 are also available in other ciphersuites; MD5 only for RC4, and SHA-256 is not widely supported yet.

SHA-1

IPP provides an 18% performance improvement, approximately matching the SSL accelerator card with a single CPU core. A single core can hash 3.6 Gbps of data.

Block cipher

Two block ciphers were selected: AES and RC4. This should provide sufficient coverage for all legacy and modern clients to successfully negotiate a ciphersuite with the server.

AES

AES is the preferred block cipher from a security perspective. It is also preferable from a performance standpoint if you are using a Westmere or newer CPU.

Moving from OpenSSL 0.9.8o to 1.0.0d results in a 48% decrease in performance on Nehalem and newer processors due to an undiagnosed issue with the x86_64 ASM in OpenSSL that means it is no longer properly optimized for this architecture. The easy fix is to simply prevent the ASM from being built and used when compiling OpenSSL.

IPP is a better solution though, as it enables OpenSSL to utilize the AES-NI instruction set on Westmere processors and yields a 280% increase in performance (above the fixed 1.0.0, for a total of 647%!). Performance is equivalent to the AES-NI OpenSSL engine, with the added benefit of lesser performance gains on with older CPUs.

AES-256 has similar performance characteristics, although it is 25% slower.

Using AES-128 and IPP, a single core can encrypt 5.6 Gbps of data.

RC4

While not as secure as AES, RC4 is important for two reasons:

Older clients may not support AES, notably selected browsers on Windows XP
It offers higher performance than AES on pre-Westmere CPUs

You can easily obtain a 32% performance increase by applying this patch, if you are using a Nehalem or newer CPU in 64-bit mode.

If you don’t want to apply this patch, another option is to prevent OpenSSL from building and using the ASM version of RC4 in Configure.

It would take 5 CPU cores to equal the performance of the SSL accelerator card, which makes this encryption function the only clear winner for the accelerator card. It is also the only algorithm tested here which I did not implement replacement IPP functions for the OpenSSL native logic. IPP does include Arcfour functions to accelerate this cipher.

A single core can encrypt 3.2 Gbps of data.

Key exchange

While there are a variety of available key exchange algorithms, only RSA is well supported at this time.

RSA

RSA keys come in various sizes, and the size selection is essential to determining the number of handshakes that can be performed per second. A 1024-bit RSA key is considered less secure, but provides the best performance currently available (sizes below 1024-bit should not be considered). Operations on a 2048-bit RSA key are approximately 5 times slower. Key sizes above 2048-bits are not necessary unless you have specialized needs; they will substantially impair performance.

The easiest way to improve RSA performance is to upgrade to OpenSSL 1.0.0, which increases performance by 25%.

The SSL accelerator card yields a 195% increase in performance over a single unaccelerated CPU core; IPP yields a 116% increase over an unaccelerated core. On two or more cores, the IPP version eclipses the SSL accelerator card.

Using RSA-1024 and IPP, a single core can process 4,238 handshakes per second.

Testing methodology

All of the benchmarks in this article were obtained by running openssl speed -elapsed on a quiet host with the scaling_governor set to performance. Unless noted, the results are for a single core’s performance (specifically an Intel Xeon X5650, which is a Westmere CPU) using a 64-bit OpenSSL binary. Performance on multiple cores will scale linearly, provided that your application can utilize multiple cores efficiently.

Performance characteristics will be different for other CPUs (such as pre-Nehalem Xeons), as well as 32-bit OpenSSL.

SSL card results were obtained by running openssl speed -elapsed -multi 12. The card used is a Cavium Nitrox, model CN1620-400-NHB.

AES: openssl speed -elapsed -evp aes-128-cbc
RC4: openssl speed -elapsed -evp rc4
SHA-1: openssl speed -elapsed -evp sha1
RSA: openssl speed -elapsed rsa1024

Acceleration options

The SSL accelerator card used for this benchmark retails for approximately $500. There are a few downsides to using a hardware device:

Must maintain a spare pool for quick replacement
Not general purpose
Constrained supply, may be difficult to procure in some regions
Requires some work to integrate with OpenSSL
May not have a free PCIe x4 slot in some servers

Intel’s IPP library is not free software, and requires a developer license at $200 per user. However, you may freely redistribute the product utilizing IPP (e.g. OpenSSL) without royalties, and there is no per-server cost. There are some downsides:

Not free or open source
Requires some work to integrate with OpenSSL
Utilizes the system CPU

Consider that upgrading the system CPU may be cheaper (and more beneficial for other purposes) than a dedicated accelerator card. A hex-core Intel Xeon X5650 is only $300 more than a quad-core Intel Xeon E5640, and the two additional cores will easily eclipse the performance of the SSL accelerator card in every algorithm except RC4.

IPP patch

The IPP cryptography samples include a patch against OpenSSL 0.9.8j, which I have modified to work with OpenSSL 1.0.0d. My changes are simply to make the patch apply cleanly, the code is otherwise identical to Intel’s original patch. That patch is available here.

When building this patched version of OpenSSL, you’ll need to keep a few things in mind:

It expects IPP to be installed in /opt/intel/composerxe-2011.3.174/ — if your IPP is somewhere else, simply modify the path listed in Configure to point to the right place.
You must use the build targets linux-elf-ipp (32-bit) and linux-x86_64-ipp (64-bit), instead of linux-elf and linux-x86_64.
You should still apply the RC4 patch as well, unless you’re using a version of OpenSSL that already includes it (none do as of this writing) — while IPP does have functions to accelerate RC4, they are not utilized in Intel’s patch.

Summary

Obtain IPP and the cryptography library
Upgrade to OpenSSL 1.0.0d, patched to utilize IPP
Modify (or upgrade to a patched version) of your application which sets SSL_MODE_RELEASE_BUFFERS
Only offer AES-128 and RC4 ciphersuites
Deploy on Intel Xeon Westmere and newer processors