Other papers...

Scientific Computing: C++ versus Fortran

Todd Veldhuizen (tveldhui@acm.org)

July 1997

1: Introduction

2: Why C++ programs are much faster than they used to be

3: Benchmark results

4: Conclusion

5: For more information

1: Introduction

The language C++ has features which make it attractive for scientific computing: templates for generic programming, operator overloading for expressiveness, and object-orientation for abstraction and code reuse. Despite these features, the scientific computing community has been reluctant to adopt C++, partly due to performance problems. In the early 1990s, C++ programs were much slower than their Fortran counterparts -- typical benchmarks showed C++ lagging behind Fortran's performance by anywhere from 20% to a factor of ten. Users of C++ often rewrote the number crunching parts of their programs in Fortran to get acceptable performance.

In the past five years, the performance of C++ programs has improved markedly due to a combination of better optimizing C++ compilers and new library techniques. In some recent benchmarks, C++ is even faster than Fortran. In this article, I'll explain these techniques and show benchmarks comparing a C++ class library (Blitz++) to Fortran.

2: Why C++ programs are much faster than they used to be

Expression templates

Operations on vectors, matrices, and arrays form the foundation of most scientific computing codes. In C++, classes can customize the meaning of operators such as (+, -, *, /). This capability, called operator overloading, enables C++ matrix/vector class libraries to provide a very natural notation. For example, this code might represent the summation of three vectors:

w = x + y + z;

Operator overloading in C++ is pairwise: to evaluate this expression, C++ forces you to first evaluate t1=x+y, then t2=t1+z, and finally assign w=t2. The t1 and t2 objects are temporaries used to store intermediate results.

Pairwise expression evaluation for vectors is slow: instead of a single loop to evaluate an expression, multiple loops and temporary vectors are generated. These temporary vectors are usually allocated dynamically, which causes a severe performance loss for small vectors. For example, suppose we evaluated w=x+y+z for vectors of length 3. The floating point arithmetic takes only a handful of clock cycles; dynamically allocating and deleting the temporary vectors consumes hundreds (or even thousands) of cycles. Consequently, C++ matrix/vector libraries tended to be horribly slow for operations on small objects.

For large vectors, the main performance issue becomes memory bandwidth. Many scientific computing codes are constrained in performance not by floating point arithmetic, but by the time required to ship data between main memory and the CPU. Suppose that each vector contained 1 Mb of data. The operator-overloaded version of w=x+y+z requires 8 Mb of data to be transferred between the CPU and main memory, exactly twice what is actually needed. As a consequence, the speed is roughly half what could be attained.

The pairwise evaluation problem has been solved by the expression templates technique. In an expression templates implementation, operators like + don't return intermediate results. From the compiler's point of view, the expression is still evaluated pairwise, but operators like + return partial parse trees of the expression (rather than intermediate results). The parse tree is encoded as a C++ class type using templates. When the parse tree is assigned to an object, the expression can be evaluated in one pass without having to generate intermediate results.

The technique has some overhead associated with the expression parsing, but the performance for long vectors is just as good as hand-coded C. A good optimizing C++ compiler is able to eliminate this overhead completely, so that performance is reasonable for small vectors too. (To get optimal performance, you have to unroll the loops completely for small vectors, as I'll describe next.)

Optimizations for small objects

In a previous article (DDJ August '96), I described special optimizations for small vectors and matrices using the template metaprogram technique. Briefly summarized, it turns out that C++ compilers can be persuaded to do arbitrary computations at compile time by metaprograms which exploit the template instantiation mechanism. One good use of template metaprograms is creating specialized algorithms. As an example, let's consider this little bit of code, which calculates a 3x3 matrix-vector product:

Matrix<float> A(3,3); Vector<float> b(3), c(3); // ... c = product(A,b);

Calculating the result requires 15 floating-point operations, but this bit of code will consume hundreds of clock cycles. Why? The Matrix and Vector objects must allocate their memory dynamically, which takes a lot of time. The product() function resides in a library somewhere, and likely consists of two nested loops to compute the matrix-vector product. When compilers optimize loop code, they often assume that the loop will be executed many times. The result is code which is very fast for large matrices, but mediocre for small matrices.

The solution I've adopted in the class library Blitz++ is to provide two versions of objects like Matrix and Vector. One version is optimized for large objects (e.g. Vector), and the other is optimized for small objects (e.g. TinyVector). The Tiny... objects use template metaprograms to create very fast code. Here's the matrix-vector product implemented with Tiny... objects:

TinyMatrix<float,3,3> A; TinyVector<float,3> b, c; c = product(A,b);

Here's a skeletal declaration of TinyVector:

template<class T, int N> class TinyVector { private: float data[N]; };

Putting the vector data inside the object allows it to reside on the stack, so the overhead of allocating memory is avoided. The function product() invokes a template metaprogram which creates a specialized matrix-vector multiplication algorithm. The metaprogram produces this code:

c[0] = A[0][0]*b[0] + A[0][1]*b[1] + A[0][2]*b[2]; c[1] = A[1][0]*b[0] + A[1][1]*b[1] + A[1][2]*b[2]; c[2] = A[2][0]*b[0] + A[2][1]*b[1] + A[2][2]*b[2];

Big blocks of uninterrupted floating point code like this are exactly what's needed to get peak performance on modern pipelined, superscalar CPUs. The metaprogram version of the 3x3 matrix-vector product takes 13 clock cycles in sustained execution on an RS/6000 Model 43P. (It's this fast because the matrix A is kept in registers and combined multiply-add instructions are used.)

Better compilers

A large part of C++'s performance problem was inadequate compilers. Happily, C++ compilers have come a long way in the past decade. To demonstrate this, let's look at KAI C++ from Kuck and Associates Inc., which has incredible optimization abilities. As an example, we'll examine a little piece of code which reflects a ray of light off a surface. We'll call the incident ray vector x, the reflected ray y, and the surface normal vector n (Figure 1).

Figure 1 is shown here.
Figure 1: Ray reflection

We'll work in a three dimensional space, so it makes sense to use the class TinyVector, which does special optimizations for small vectors. Here's the reflection function:

typedef TinyVector<double,3> ray; inline void reflect(ray& y, const ray& x, const ray& n) { y = x - 2 * dot(x,n) * n; }

The reflect() code looks very similar to the mathematical equation which it implements: y = x - 2 (x . n) n. Behind this simplicity lies an elaborate implementation: first, a template metaprogram unrolls the dot product dot(x,n) into the code x[0]*n[0]+x[1]*n[1]+x[2]*n[2]. The vector arithmetic is then parsed using expression templates, which produces temporary objects corresponding to partial parse trees. Evaluation of the expression for y[0], y[1] and y[2] is unrolled using another template metaprogram. (KAI C++ is smart enough to unroll the loops itself without metaprograms, but other C++ compilers aren't). KAI C++ is able to optimize away all the expression templates overhead using copy propagation and dead code elimination, and produce code which is equivalent to this:

double _t1 = x[0]*n[0] + x[1]*n[1] + x[2]*n[2]; double _t2 = _t1 + _t1; y[0] = x[0] - _t2 * n[0]; y[1] = x[1] - _t2 * n[1]; y[2] = x[2] - _t2 * n[2];

Examining this code, we see 6 multiplies, 3 additions, 3 subtractions, 6 loads and 3 stores for a total of 21 operations. On an RS/6000, the assembly code generated by KAI C++ and the back end compiler consists of only 20 opcodes. The data are kept in registers, and operated on by combined multiply-add instructions. This is very impressive considering how complex the expression templates machinery is. But it gets better! Suppose we want to calculate the reflection of a particular ray. Overloading the comma operator gives us a concise way to assign vector literals:

void test(ray& y) { ray x, n; x = 1.00, 0.40, -1.00; n = 0.31, 0.20, 0.93; reflect(y, x, n); }

When this code is compiled, KAI C++ discards the temporary objects created by the overloaded comma operator, does the optimizations mentioned before for reflect(), and performs constant folding to evaluate arithmetic on constant values. The final code is equivalent to:

void test(ray& y) { y[0] = 1.3348; y[1] = 0.6160; y[2] = 0.0044; }

The compiler calculates the reflection at compile time, and ditches the x and n vectors, since they're not needed anymore. Now that's an optimizing compiler!

So C++ compilers have come a long way in the past decade. When compilers like KAI C++ are combined with library techniques such as expression templates and template metaprograms, the resulting performance is very impressive, as the benchmarks will illustrate.

3: Benchmark results

All benchmarks were run on a 100 MHz IBM RS/6000 Model 43P (PowerPC) with 16 Kb of L1 cache and 256 Kb of L2 cache. The C++ programs were compiled using KAI C++ at +K3 -O2, and Fortran 77 code was compiled with XL Fortran at -O3.

Linear algebra

The first benchmark measures the performance of a DAXPY operation, short for "Double precision A times X Plus Y". This routine is the workhorse of many linear algebra operations, such as Gaussian elimination. Here's a DAXPY operation using the Blitz++ class library:

Vector<double> x(N), y(N); double a; . . y += a * x;

Figure 2 shows benchmark results for four versions: the Blitz++ classes TinyVector and Vector, Fortran 77, and the class valarray (Modena implementation) from the draft ISO/ANSI C++ standard. The class TinyVector is optimized for very small vectors, so its performance is much better than the other implementations for vectors of length 1..10. The Vector class and the Fortran 77 implementation have some loop overhead which makes their performance poorer for small vectors. As you can see, for longer vectors their performance is very similar. The sudden performance drop around N=1000 occurs because the vectors are no longer small enough to fit in the L1 cache.

Figure 2 is shown here.
Figure 2: DAXPY Benchmark

The performance of valarray is typical of older C++ class libraries which use pairwise evaluation and dynamically allocated temporaries. Notice that for very small vectors, memory allocation overhead causes the performance to be awful -- about a tenth that of Vector and Fortran 77. Its peak performance is also terrible, since pairwise evaluation forces it to move twice as much memory around as necessary. The extra memory use causes valarray to experience the L1 cache crunch much sooner than Vector or Fortran 77.

Fortran 90, the successor to Fortran 77, was also benchmarked but on this platform the compiler is not very good -- the performance was much worse than Fortran 77.

Array stenciling

Array stencils are common in signal processing and the solution of partial differential equations. A stencil updates an array element using a function of the elements in a local neighbourhood. Here's a Fortran 77 stencil operation which smoothes a 3D array:

DO k=2, N-1 DO j=2, N-1 DO i=2, N-1 A(i,j,k) = (B(i,j,k) + B(i+1,j,k) + B(i-1,j,k) . + B(i,j+1,k) + B(i,j-1,k) + B(i,j,k+1) . + B(i,j,k-1)) / 7.0; END DO END DO END DO The performance of stencil operations is heavily constrained by memory accesses, so efficient cache use is critical. Most computers have (at least) two levels of cache: the L1 cache, which is part of the CPU, and the external L2 cache. Of these, the L1 cache is faster, so for good performance the L1 cache must be used as much as possible. In the straightforward Fortran implementation above, the L1 cache hit rate is typically about 54% for a large array (this means that 54% of the data is found in the L1 cache; the remaining data has to be pulled in from the L2 cache or main memory). To get a higher L1 cache hit rate, it's necessary to traverse the array in a different order. One approach is dividing the array into smaller subarrays, and applying the stencil to each subarray in turn. This is commonly called tiling or blocking. Converting the above stencil to a blocked version would require five (or six) nested loops. In some cases, Fortran compilers can transform an algorithm into a blocked version automatically, but often not for complicated stencils.

One problem with blocking is selecting the block size: it has to be chosen carefully with the particular architecture and cache sizes in mind to get optimal performance. In the Blitz++ library, I decided to adopt a slightly different approach, based on the Hilbert space filling curve (Figure 3). In this approach, there are two loops: an outer loop which follows an approximation to the Hilbert space filling curve (1), and an inner loop (2) which traverses the corresponding column of the array. The space filling curve ensures good cache use, and the inner loop allows the CPU pipelines to pick up steam as the column is traversed.

Figure 3 is shown here.
Figure 3: Hilbert curve array traversal

I've found this traversal ordering has a very nice property: performance increases steadily for every extra bit of cache you have, resulting in good performance on all platforms. For most platforms, the L1 cache hit rate is 70% to 76% (compared to only 54% for the Fortran 77 program described above). This isn't the case for blocked algorithms, where performance makes abrupt jumps at critical cache sizes, making tuning very platform-dependent.

Implementing the Hilbert curve traversal is complicated: it requires about a hundred lines of code. In Blitz++, this complexity is completely hidden inside the library. The user level code looks like this:

const int N = 64; Array<double,3> A(N,N,N), B(N,N,N); . . Range I(1,N-2), J(1,N-2), K(1,N-2); A(I,J,K) = (B(I,J,K) + B(I+1,J,K) + B(I-1,J,K) + B(I,J-1,K) + B(I,J+1,K) + B(I,J,K-1) + B(I,J,K+1)) / 7.0;

Using the expression templates technique, this code is transformed into a Hilbert curve traversal. To implement a similar traversal in Fortran, you'd have to write a hundred lines of code, or rewrite the compiler.

Figure 4 shows a benchmark comparing the Blitz++ version of the 3D array stencil with the Fortran 77 version for a variety of array sizes. For very small arrays, Fortran 77 has the upper hand, because its code is much simpler. For larger arrays, the extra complexity of the Hilbert curve traversal pays off: the higher L1 cache hit rate translates into better performance.

Figure 4 is shown here.
Figure 4: 3D Array Stencil Benchmark

Lattice Quantum Chromodynamics

The last benchmark I'd like to describe is drawn from Lattice Quantum Chromodynamics (QCD). Lattice QCD is a computational version of the Standard Model of particle physics. It's used to calculate particle properties (such as mass) predicted by current theory. Lattice QCD is very demanding: these programs engage supercomputers for months or years at a time.

This benchmark is based on a Lattice QCD implementation from the Edinburgh Parallel Computing Centre in the UK. In the EPCC implementation, the bulk of CPU time is spent multiplying 3x3 and 3x2 matrices of complex numbers. This is an ideal application for the Blitz++ TinyMatrix class, which uses template metaprograms to create specialized algorithms for matrix-matrix multiplication. Here's an initial version of the Blitz++ benchmark:

typedef TinyMatrix<complex<double>,3,3> gaugeElement; typedef TinyMatrix<complex<double>,3,2> spinor; void qcdCalc(Vector<spinor>& res, Vector<gaugeElement>& M, Vector<spinor>& src) { for (int i=0; i < res.length(); ++i) res[i] = product(M[i], src[i]); }

The arguments res and src are vectors of 3x2 complex matrices; M is a vector of 3x3 complex matrices. Here's the initial Fortran 77 version:

subroutine qcdf(M, res, src, V) integer V, iters double complex M(V,3,3), res(V,3,2), src(V,3,2) DO site=1,V DO spin=1,2 DO col=1,3 res(site,col,spin) = M(site,col,1) * src(site,1,spin) . + M(site,col,2) * src(site,2,spin) . + M(site,col,3) * src(site,3,spin) ENDDO ENDDO ENDDO return end

Figure 5 shows the benchmark results for these two codes. The performance of the initial Fortran 77 version was disappointing due to poor memory layout of the data, and the failure of the compiler to unroll the inner loops. The performance doubled after optimizing the memory layout and manually unrolling the inner loops (see the series "Fortran 77 tuned" in Figure 5), but still wasn't as good as Blitz++. Examination of the assembly code revealed that the Fortran compiler was suffering from "register spillage" -- it was unable to fit all the data in registers, so their contents had to be spilled out onto the stack. This is a compiler problem; with a better code generator, the Fortran version ought to be as fast as the initial Blitz++ version.

Figure 5 is shown here.
Figure 5: Lattice QCD Benchmark

To tune the Blitz++ version, I took a step that is second nature to C++ programmers. The related objects were encapsulated in a single class:

class latticeUnit { ... protected: spinor one; gaugeElement ge; spinor two; };

The revised qcdCalc() routine operated on a single vector of latticeUnit objects, rather than three separate vectors. This encapsulation ensured that related objects were stored nearby in memory, a property known as locality of reference. Cache designs assume programs will exhibit locality of reference, so encapsulation tends to improve performance. In the QCD benchmark, encapsulation boosted performance by 15-20% (see the "Blitz++ tuned" series in Figure 5).

4: Conclusion

After nearly a decade of being dismissed as too slow for scientific computing, C++ has caught up with Fortran and is giving it stiff competition. The combination of new library design techniques (expression templates, template metaprograms) and better compilers means that programmers can enjoy the expressiveness and ease of C++ without paying a performance price.

5: For more information

For references and more information about the techniques described in this article, see the Blitz++ web page at http://oonumerics.org/blitz/