## This post is about optimizing the AABB tree that we're using as our main acceleration structure.

I will use the polly scene from previous post for the tests, but have increased the output resolution to get results a bit more meaningful.

Initial performance:

Scene: Project Polly

Resolution: fullHD 1920x1080

Primary rays per pixel: 64

Results: 200 s, ~660k Rays/s

Taking a first look at the profiler, there's an obvious fail:

We're spending 24% of run time recreating the implicit representation of the rays again and again for each tree node.

Instead of that, we can compute it once at the top of the tree, and pass it down to the nodes (this commit).

New results: 162s, ~820k Rays/s

24% faster and implicit() no longer shows on the profiler. Great :)

### Simple SIMD

The main bottleneck is clearly AABB-ray intersection, taking almost 50% of the time. This is the code for it:

bool intersect(const Ray::Implicit& _r, float _tmin, float _tmax, float& _tout) const { Vector t1 = mMin *_r.n + _r.o; Vector t2 = mMax *_r.n + _r.o; _tmin = std::max(_tmin, std::min(t1.x(), t2.x())); _tmax = std::min(_tmax, std::max(t1.x(), t2.x())); _tmin = std::max(_tmin, std::min(t1.y(), t2.y())); _tmax = std::min(_tmax, std::max(t1.y(), t2.y())); _tout = std::max(_tmin, std::min(t1.z(), t2.z())); _tmax = std::min(_tmax, std::max(t1.z(), t2.z())); return _tmax >= _tout; }

This is the usual algorithm of finding the intersection interval (I think it's called the 'slab' method).

We're doing the same min and max operations on all the components, so it looks like this code can be vectorized using SIMD. Here's the vectorized version (using a wrapper class for intrinsics)

bool intersect(const Ray::ImplicitSimd& _r, float _t0, float _t1, float& _tCollide) const { Vector t1 = (mMin - _r.o) * _r.n; Vector t2 = (mMax - _r.o) * _r.n; // Swapping the order of comparison is important because of NaN behavior and SSE auto tEnter = math::min(t2,t1); // Enters _tCollide = std::max(_t0, tEnter.hMax()); // Furthest enter auto tExit = math::max(t1,t2); // Exits float closestExit = std::min(_t1, tExit.hMin()); return closestExit >= _tCollide; }

Performance is: 110s, 1.2 million rays/s

That's another 32% faster than the non SIMD version.

While that code seems to do a good job in general, it doesn't handle the edge cases properly (neither version does).

### Fixing Correctness

Let's examine first the scalar version of the code, fix it, and then see how to turn that into proper vector code. I also wrote some tests to verify the correct handling of edge cases. Unsurprisingly, the code above didn't pass them.

The first problem comes from the way I represent the impicit ray. _r.n contains the inverse direction of the ray ( {1/dir.x, 1/dir.y, 1/dir.z} ) and _r.o contains minus the origin of the ray, premultiplied by that inverse. When some component of the direction is 0 (i.e. for rays parallel to one of the main planes), both _r.n and _r.o become infinity, sometimes with opposite signs, and adding opposite infinities gives NaN. NaN is not always the end of the world, but in our case, we want to intersect parallel planes at infinity, not at NaN. Storing the ray's unaltered origin at _r.o, and reordering the operations, solves this problem

Vector t1 = (mMin - _r.o) * _r.n; Vector t2 = (mMax - _r.o) * _r.n;

For any finite bounding box, mMin and mMax can not contain infinities, so no matter where the origin of the ray is, this subtraction will not result in NaN. It can be +- infinity, but a normalized direction's inverse can never contain 0. The inverse case is still possible, though. When the ray's origin coincides with one of the boundaries of the box, and the ray's direction is parallel to that plane, we have 0*infinity = NaN. If the box has 0 size in the direction perpendicular to the plane, this NaN will propagate to both t1 and t2. Otherwise it will only happen in one of them at most for each coordinate. What we need now, is a way to contain that NaN, when it happens, so it never reaches the end result.

First, let's group all those max, min operations into a 'vector' operation to make them clearer. I'm using vector here in the sense that we're using the vector classes in the math library to improve readability, not in the SIMD vector sense.

auto tEnter = math::min(t1,t2); // Enters auto tExit = math::max(t1,t2); // Exits

_tout = std::max(_tmin, std::max(tEnter.x(), std::max(tEnter.y(), tEnter.z())));

auto minExit = std::min(_tmax, std::min(tExit.x(), std::min(tExit.y(),tExit.z())));

This code is almost good. A NaN from t1 or t2 can make it into tEnter and tExit. Since comparing NaN with any number will return false, both std::min and std::max will return the first argument, meaning that when NaN is that first argument, we'll be destroying information from the other argument (which can be valid information). Moreover, if math::max and math::min of our vector class are internally based on std::max and min, NaNs in t1 will make it into both tEnter and tLeave, destroying more information. However, if we're clever in reordering our calls, we can use this property to our advantage.

auto tEnter = math::min(t1,t2); // Enters auto tExit = math::max(t2,t1); // Exits

_tout = std::max(std::max(std::max(_tmin, tEnter.x()), tEnter.y()), tEnter.z());

auto minLeave = std::min(std::min(std::min(_tmax, tLeave.x()), tLeave.y()),tLeave.z());

Since _tmin and _tmax can not be NaN in a well formed call to this function, the code above will preserve valid information and destroy the NaNs. Also, since we swaped the order of t1 and t2 in the second line, NaNs from those will only make it into one of the boundary vectors (except in the case of a degenerate AABB with no volume, where there's no information to preserve anyway). With this new version, rays contained withing one of the sides of the AABB will behave as if intersecting the volume at the same time than some other coordinate, and since a unit vector can't be parallel to 3 perpendicular planes, there will always be at least one dimension in which this gives a valid number. Also, +-infinities get gracefully resolved by the max and mins as +infinities will be part of exits and -infinities will be part of entries, making them irrelevant extremes (e.g. any finite exit will be smaller than infinity).

That code works, but coherency can be slightly improved. The standard behavior of max and min, of returning the first operand on NaN is exactly the opposite to what SIMD does, so to keep code consistent, I've implemented the math::max and math::min methods returning the second operand instead. With that, the reordering of comparisons is as follows:

Vector t1 = (mMin -_r.o)*_r.n; Vector t2 = (mMax -_r.o)*_r.n; // Swapping the order of comparison is important because of NaN behavior auto tEnter = math::min(t1,t2); auto tLeave = math::max(t2,t1); _out = math::max(tEnter.x(), math::max(tEnter.y(), math::max(tEnter.z(), _tmin))); auto minLeave = math::min(tLeave.x(), math::min(tLeave.y(), math::min(tLeave.z(), _tmax))); return minLeave >= _maxEnter;

### Correct SIMD

That code is almost trivial to vectorize. The only problem is that all those nested comparisons don't seem very simd-friendly. However, one can notice that the operation must be symmetric on the components, meaning that it should treat x, y, and z equally, and the order of comparisons doesn't matter betweem them (still matters with tmin and tmax!). As such, we can compare the 3 components at once against quadruplicated tmin and tmax, and eliminate all NaNs at once. All that's left is an horizontal comparison of the three components, et voilĂ :

Vector t1 = (mMin - _r.o) * _r.n; Vector t2 = (mMax - _r.o) * _r.n; // Swapping the order of comparison is important because of NaN behavior auto tEnter = math::min(t2,t1); // Enters auto tExit = math::max(t1,t2); // Exits auto tMin = float4(_t0); auto tMax = float4(_t1); auto maxEnter = math::max(tEnter,tMin); // If nan, return second operand, which is never nan auto minLeave = math::min(tExit,tMax); // If nan, return second operand, which is never nan _tCollide = maxEnter.hMax(); // Furthest enter return minLeave.hMin() >= _tCollide;

Now, how does this code perform? Our Polly test scene, in the same conditions as before, takes about 102 seconds on average, making 1.3 million primary rays per second. Runs about 8% faster, and produces more correct results. Why faster? I'm guessing that not leaving the SIMD code until the very end helps here, but I haven't inspected the generated assembly. Overall, by using SIMD code, AABB intersection tests have gone from ~78 seconds (48% of 162 seconds) to ~31 seconds, which is 60% faster. Since we're only using 3 of the four components of the SIMD vectors, our theoretical limit was 66% improvement, and this is quite close.

### Next time

We can still try to do better by performing many tests at once, instead of vectorizing the internals of the test. We can also play aroung with memory layouts, or improve the distribution of our tree, so we hopefully do less tests in the first place. Another possibility is going back to adding features to the raytracer. Textures would be a nice one.

## No comments:

## Post a Comment