The language C++ has features which make it attractive for scientific computing: templates for generic programming, operator overloading for expressiveness, and object-orientation for abstraction and code reuse. Despite these features, the scientific computing community has been reluctant to adopt C++, partly due to performance problems. In the early 1990s, C++ programs were much slower than their Fortran counterparts -- typical benchmarks showed C++ lagging behind Fortran's performance by anywhere from 20% to a factor of ten. Users of C++ often rewrote the number crunching parts of their programs in Fortran to get acceptable performance.

In the past five years, the performance of C++ programs has
improved markedly due to a combination of better optimizing C++
compilers and new library techniques. In some recent
benchmarks, C++ is even *faster* than Fortran.
In this article, I'll explain these techniques
and show benchmarks comparing a C++ class library (Blitz++)
to Fortran.

Operations on vectors, matrices, and arrays form the
foundation of most scientific computing codes. In
C++, classes can customize the meaning of operators such
as (+, -, *, /). This capability, called *operator
overloading*, enables C++ matrix/vector class libraries
to provide a very natural notation. For example, this code might
represent the summation of three vectors:

Operator overloading in C++ is pairwise: to evaluate
this expression, C++ forces you to first evaluate `t1=x+y`

,
then `t2=t1+z`

, and finally assign `w=t2`

. The `t1`

and
`t2`

objects are *temporaries* used to store intermediate results.

Pairwise expression evaluation for vectors is slow: instead of
a single loop to evaluate an expression,
multiple loops and temporary vectors are generated.
These temporary vectors are usually allocated dynamically,
which causes a severe performance loss for small
vectors. For example, suppose we evaluated `w=x+y+z`

for vectors
of length 3. The floating point arithmetic takes
only a handful of clock cycles; dynamically allocating
and deleting the temporary vectors consumes *hundreds* (or even
thousands) of cycles.
Consequently, C++ matrix/vector libraries tended to be
horribly slow for operations on small objects.

For large vectors, the main performance issue becomes
memory bandwidth. Many scientific computing codes are
constrained in performance not by floating point arithmetic,
but by the time required to ship data between main memory
and the CPU. Suppose that each vector contained
1 Mb of data. The operator-overloaded version of
`w=x+y+z`

requires 8 Mb of data to be transferred between
the CPU and main memory, exactly twice what is actually
needed. As a consequence, the speed is roughly half
what could be attained.

The pairwise evaluation problem has been solved by the
*expression templates* technique. In an expression
templates implementation, operators like + don't return
intermediate results. From the compiler's point of view,
the expression is still evaluated pairwise, but operators
like + return partial parse trees of the expression
(rather than intermediate results).
The parse tree is encoded as a C++ class type using templates.
When the parse tree is assigned to an object, the expression
can be evaluated in one pass without having to generate
intermediate results.

The technique has some overhead associated with the
expression parsing, but the performance for long vectors
is just as good as hand-coded C. A good optimizing C++
compiler is able to eliminate this overhead completely,
so that performance is reasonable for small vectors too.
(To get *optimal* performance, you have to unroll the
loops completely for small vectors, as I'll describe
next.)

In a previous article (DDJ August '96), I described
special optimizations for small vectors and matrices
using the *template metaprogram* technique.
Briefly summarized, it turns out that C++ compilers
can be persuaded to do arbitrary computations at
compile time by *metaprograms* which exploit the
template instantiation mechanism. One good use
of template metaprograms is creating *specialized algorithms*.
As an example, let's consider this little bit of code,
which calculates a 3x3 matrix-vector product:

Calculating the result requires 15 floating-point operations,
but this bit of code will consume hundreds of clock cycles.
Why? The Matrix and Vector objects must allocate their
memory dynamically, which takes a lot of time. The
`product()`

function resides
in a library somewhere, and likely consists of two nested
loops to compute the matrix-vector product. When compilers
optimize loop code, they often assume that the loop will
be executed many times. The result is code which is
very fast for large matrices, but mediocre for small
matrices.

The solution I've adopted in the class library
Blitz++ is to provide two versions of objects like
Matrix and Vector. One version is optimized for
large objects (e.g. `Vector`

), and the
other is optimized for small objects (e.g. `TinyVector`

).
The `Tiny...`

objects use template metaprograms to
create very fast code. Here's the matrix-vector
product implemented with `Tiny...`

objects:

Here's a skeletal declaration of `TinyVector`

:

Putting the vector data inside the object allows it
to reside on the stack, so the overhead of allocating
memory is avoided.
The function `product()`

invokes a template metaprogram
which creates a specialized matrix-vector multiplication algorithm.
The metaprogram produces this code:

Big blocks of uninterrupted floating point code like this
are exactly what's needed to get peak performance on modern
pipelined, superscalar CPUs. The metaprogram version of the
3x3 matrix-vector product takes 13 clock cycles in
sustained execution on an RS/6000 Model 43P. (It's
this fast because the matrix `A`

is kept in registers
and combined multiply-add instructions are used.)

A large part of C++'s performance problem was inadequate
compilers. Happily, C++ compilers have come a long way in the past
decade. To demonstrate this, let's look at KAI C++ from Kuck and
Associates Inc., which has incredible optimization
abilities. As an example, we'll examine a little piece
of code which reflects a ray of light off a surface.
We'll call the incident ray vector ** x**, the reflected ray

`y`

`n`

Figure 1: Ray reflection

The `reflect()`

code looks very similar to the mathematical
equation which it implements: **y** = **x** - 2 (**x** . **n**) **n**.
Behind this simplicity lies an elaborate implementation:
first, a template metaprogram unrolls the dot product `dot(x,n)`

into the code `x[0]*n[0]+x[1]*n[1]+x[2]*n[2]`

. The vector arithmetic
is then parsed
using expression templates, which produces temporary objects
corresponding to partial parse trees.
Evaluation of the expression for `y[0]`

, `y[1]`

and `y[2]`

is unrolled using another template metaprogram.
(KAI C++ is smart enough to unroll the loops itself without
metaprograms, but other C++ compilers aren't).
KAI C++ is able
to optimize away all the expression
templates overhead using *copy propagation* and *dead code
elimination*, and produce code which is equivalent to this:

Examining this code, we see 6 multiplies, 3 additions, 3 subtractions, 6 loads and 3 stores for a total of 21 operations. On an RS/6000, the assembly code generated by KAI C++ and the back end compiler consists of only 20 opcodes. The data are kept in registers, and operated on by combined multiply-add instructions. This is very impressive considering how complex the expression templates machinery is. But it gets better! Suppose we want to calculate the reflection of a particular ray. Overloading the comma operator gives us a concise way to assign vector literals:

When this code is compiled, KAI C++ discards
the temporary objects created by the overloaded comma operator, does the
optimizations mentioned before for reflect(), and performs *constant folding*
to evaluate arithmetic on constant values. The final code is equivalent
to:

The compiler calculates the reflection at compile time, and ditches the
`x`

and `n`

vectors, since they're not needed anymore. Now that's
an optimizing compiler!

So C++ compilers have come a long way in the past decade. When compilers like KAI C++ are combined with library techniques such as expression templates and template metaprograms, the resulting performance is very impressive, as the benchmarks will illustrate.

All benchmarks were run on a 100 MHz IBM RS/6000 Model 43P (PowerPC) with 16 Kb of L1 cache and 256 Kb of L2 cache. The C++ programs were compiled using KAI C++ at +K3 -O2, and Fortran 77 code was compiled with XL Fortran at -O3.

The first benchmark measures the performance of a DAXPY operation, short for "Double precision A times X Plus Y". This routine is the workhorse of many linear algebra operations, such as Gaussian elimination. Here's a DAXPY operation using the Blitz++ class library:

Figure 2 shows benchmark results for four versions:
the Blitz++ classes `TinyVector`

and `Vector`

, Fortran 77,
and the class `valarray`

(Modena implementation) from the draft ISO/ANSI
C++ standard.
The class `TinyVector`

is optimized for very small vectors,
so its performance is much better than the other implementations
for vectors of length 1..10. The `Vector`

class and the
Fortran 77 implementation have some loop overhead which makes
their performance poorer for small vectors. As you can see,
for longer vectors their performance is very similar. The
sudden performance drop around N=1000 occurs because the vectors
are no longer small enough to fit in the L1 cache.

Figure 2: DAXPY Benchmark

One problem with blocking is selecting the block size: it has to be chosen carefully with the particular architecture and cache sizes in mind to get optimal performance. In the Blitz++ library, I decided to adopt a slightly different approach, based on the Hilbert space filling curve (Figure 3). In this approach, there are two loops: an outer loop which follows an approximation to the Hilbert space filling curve (1), and an inner loop (2) which traverses the corresponding column of the array. The space filling curve ensures good cache use, and the inner loop allows the CPU pipelines to pick up steam as the column is traversed.

Figure 3: Hilbert curve array traversal

Using the expression templates technique, this code is transformed into a Hilbert curve traversal. To implement a similar traversal in Fortran, you'd have to write a hundred lines of code, or rewrite the compiler.

Figure 4 shows a benchmark comparing the Blitz++ version of the 3D array stencil with the Fortran 77 version for a variety of array sizes. For very small arrays, Fortran 77 has the upper hand, because its code is much simpler. For larger arrays, the extra complexity of the Hilbert curve traversal pays off: the higher L1 cache hit rate translates into better performance.

Figure 4: 3D Array Stencil Benchmark

The arguments `res`

and `src`

are vectors of 3x2 complex
matrices; `M`

is a vector of 3x3 complex matrices. Here's
the initial Fortran 77 version:

Figure 5 shows the benchmark results for these two codes. The performance of the initial Fortran 77 version was disappointing due to poor memory layout of the data, and the failure of the compiler to unroll the inner loops. The performance doubled after optimizing the memory layout and manually unrolling the inner loops (see the series "Fortran 77 tuned" in Figure 5), but still wasn't as good as Blitz++. Examination of the assembly code revealed that the Fortran compiler was suffering from "register spillage" -- it was unable to fit all the data in registers, so their contents had to be spilled out onto the stack. This is a compiler problem; with a better code generator, the Fortran version ought to be as fast as the initial Blitz++ version.

Figure 5: Lattice QCD Benchmark

The revised `qcdCalc()`

routine operated on a single vector
of `latticeUnit`

objects, rather than three separate
vectors. This encapsulation ensured that related
objects were stored nearby in memory, a property
known as *locality of reference*. Cache designs assume
programs will exhibit locality of reference, so
encapsulation tends to improve performance. In
the QCD benchmark, encapsulation boosted performance
by 15-20% (see the "Blitz++ tuned" series in Figure
5).

After nearly a decade of being dismissed as too slow for scientific computing, C++ has caught up with Fortran and is giving it stiff competition. The combination of new library design techniques (expression templates, template metaprograms) and better compilers means that programmers can enjoy the expressiveness and ease of C++ without paying a performance price.

For references and more information about the techniques described in this article, see the Blitz++ web page at http://oonumerics.org/blitz/