# Another View on the Classic Ray-AABB Intersection Algorithm for BVH Traversal

#### Motivation

Rendering, physics, and AI applications benefit from fast ray-box intersection tests. For me, ray-axis-aligned-bounding-box (AABB) intersection tests are of particular interest, because Bounding Volume Hierarchies (BVHs) are commonly built with AABBs. BVHs, in turn, are the most common acceleration structures for ray tracing. To traverse BVHs efficiently, a fast algorithm to judge whether a ray is piercing an AABB is essential. This way, the algorithm determines whether primitives or child nodes need to be tested or discarded because the ray does or does not hit the parent AABB.

#### Most common algorithm

There are a few algorithms which solve the ray-AABB intersection problem. Majercik et al. give an excellent overview of existing ray-box tests. Which algorithm to choose amongst these, is usually dependent on a few factors:

• Hardware (CPU/GPU, Which architecture?)
• Context (Where is this algorithm called? How many times? What outcoming values do we expect? (e.g. a boolean, parametric min and max values, hit normal?)
• Performance, Precomputation
• Ease of understanding and implementation

To my knowledge, the simplest but also most competitive ones are variants of slabs tests originating from Kay and Kajiya. Andrew Kensler proposed a simple and fast variant featured on Peter Shirley’s Blog, which is also referenced as one of the most novel ones in the real-time rendering intersection table. A very similar variant is also given by Pharr et al.

There are many sources explaining the concept behind the slabs algorithm [Kay and Kajiya| Pharr et al. | Scratchapixel | Broman]. So I’ll skip the explanation at this point.

#### My story — and an idea

A s part of my master thesis in fall 2016, I worked on a ray- and cone tracing framework in Unity3D. I implemented the creation of offline acceleration structures on the CPU and traversal on the GPU. Our goal was a modular framework to enable the exchange of different modules in the ray/cone tracing pipeline (e.g. variants of triangle intersectors or acceleration structures).

In this context, I also integrated Andrew Kensler’s ray-AABB intersection algorithm variant from the previous section. But, when I learned about SIMD/SSE and related concepts, I was always wondering if this algorithm could be further optimized. At the same time, I also had not heard from anyone doing this yet, while I was certain it would be possible — somehow.

I was eager to take a closer look at this but, unfortunately, this was not the goal of my thesis. So at that time, the idea dropped from the table and just wandered in the back of my unconscious mind.

Fast forward to fall 2018. With the recent emerges in RTX and DirectX Ray tracing, I am very happy that the “holy grail” of computer graphics will finally come to consumers. It is very impressive what NVIDIA pulled off here. Also, adding legends in the field like Ingo Wald and Matt Pharr to their existing team of world-class experts, I am sure much more exciting news will come from them in the near future.

With all these ray tracing developments coming at me, I also looked back on my time researching in this area. And then, I remembered this idea of improving Kensler’s variant of the slabs algorithm. So I just had to sit down and get closure on this topic: Is it possible to extend it with vector intrinsics, and if so, why hadn’t I’ve heard from anyone about this possibility yet?

#### Applying my ideas to the classic Ray-AABB algorithm

I will try to frame my approach following these guidelines:

1. Use intrinsic GPU functions where possible: They are likely to be optimized and provide a faster solution than a manual implementation (I hope!).
2. Prepare for the worst-case: Assume that our algorithm does not get an early out.
3. Take advantage of vectorization where possible.

We’ll implement the algorithm on the GPU in HLSL. This way, we can utilize it in Compute Shaders and Unity3D.

Let’s start.

We begin with Andrew Kensler’s variant, just converted into HLSL:

Let’s grab the low-hanging fruits first.

In line 3, the reciprocal of the ray direction is computed. In HLSL, we can replace the manual computation with the intrinsic function call `rcp`(please see here for a full list of intrinsic functions in HLSL). Also, in line 12–13 we use ternary operators to assign `tmin` the max value of `(t0, tmin)` and `tmax` with the min value of `(t1, tmax)`. Consequently, we can replace them with `min/max` function calls.

Cool.

Now what was always bothering me is the for-loop in this algorithm. In the worst-case, meaning that a ray really intersects with an AABB, the algorithm needs to go through the loop three times (one time for each axis) and does not get any early out. Consequently, all branch operations and computations are executed three times. This may not sound like too much, but when considering that this is computed for multiple AABBs on multiple levels on multiple BVHs for millions of rays per frame: phew.

When examining the algorithm closely, we see that even in the best-case, the loop is executed one time. The question arises: if we compute the reciprocal for the ray direction on one axis anyway in the best-case, why don’t we just compute all three axes at once through vectorization so that we’re also well-prepared for the worst-case? The same applies for the next two lines of code, in which we can compute the parametric distances `t0` and `t1` to `aabb.min` and `aabb.max` for all three axes at once because there is no sequential computational dependency between these values. In addition, we can also do the `invD < 0.0f` check in line 6 in a vectorized format.

Let’s see where we’re getting with this:

I love that we’re moving things out of the loop here. But there is still too much left we also need to get rid of.

It’s good already that we just check for a boolean value in the if-clause instead of checking the `invD` condition three times. But we can go further. We test the sign of the direction (or inverse direction — makes no difference here) in line 9 because we’re interested if `t0` or `t1` is bigger and then swap these values accordingly (so that `t1 > t0`). Well: We could also use the `min/max` functions again to ensure this! And when applying them to our `float3` vectors, we avoid branching and compute the min/max for all axes in one line of code, respectively. We can then also delete the `invD` check completely.

After that, the for loop only consists of the selection of the min/max component of two vectors and `tmin / tmax` — and a possible early out. Since we want to optimize for the worst-case, we assume that the early out in line 19 does actually more harm than actually helping us. Remember: in the worst-case, we pay three times for it!

So we’re just left with the min/max component selection, which we can easily unroll and — boom: the complete loop is gone. Let’s hope it was worth the effort.

Finally, we return the value of `tmin < tmax`.

Then, the algorithm transforms into:

#### Results

Let’s find out if all the changes were worth going through. In the following, I’ll compare the performance of Andrew Kensler’s variant with the final transformation of the previous section. For this purpose, I employed both in a Unity3D ray tracing framework with BVH traversal. In this test, we’ll just focus on the traversal. So we’ll only shoot primary rays from the camera, traverse them through the BVH, and sum the number of traversal steps / number of intersections with triangles (without computing them). This will provide us with a nice heatmap visualization and we can profile rendering times for one frame. We will conduct these tests with some traditional test meshes from Morgan McGuires Computer Graphics Archive (Thanks for providing them!)

Let’s have a look at the resulting data:

In these tests, the new variant provides a speedup of ~1.33 on average. Awesome! For example, this is the difference between 30 fps to 40 fps, or 50 fps to 66 fps.

#### What now?

After seeing the results, I was thrilled to share these with other people to discuss them. But before sending them out, I wanted to be sure one more time that I did not miss anything. Especially because I was not actively doing research since my thesis.

So I checked again, and to my surprise, found a freshly published JCGT paper from just two days ago: A Ray-Box Intersection Algorithm and Efficient Dynamic Voxel Rendering by Majercik, Crassin, Shirley, and McGuire (NVIDIA). They present a novel intersection algorithm for arbitrarily oriented boxes and showcase it by rendering a huge amount of dynamic voxels via voxel ray tracing.

But more interestingly for me, they also wrote about an “efficient slab test”, which is apparently used in implementations of NVIDIA OptiX and Intel Embree.

In their listing, they put it as:

I was stunned that this was basically the same as the result of the process I was just going through. In their findings, they show that this is the currently most efficient algorithm known for a simple ray-AABB intersection test. They add that “to their knowledge, this important efficient slabs method used by industry ray tracers is not widely known.”

I have to say: I am very happy that I come to the same conclusion as the authors, and my journey without this knowledge was very exciting. In my tests, the efficient slabs test outperformed as well. Also, I agree that this variant is currently not very well known.

So I guess my goal is now to make this form of implementation more well-known across developers. I hope this article helps in doing so!

Cheers
Broman