NOTE: This is not a blog post and it will be improved over time as I learn more about the subject. Feel free to ask questions or suggest additions!
The problem I am looking at today is how deal with the inconvenient problem of reduced performance when shooting incoherent rays. It comes in a few different variants but in this post I will try to only discuss the situation on the GPU. Note that I will not discuss this in the context of DXR (or RTX) since I don't know how it works on the GPU. I especially don't know how to submit a batch of rays that you know are coherent so “help” the GPU, or if it already handled or all for nothing since they will be randomized anyway. I will also not talk about specifics of why incoherency is bad but mostly note when it happens and how it can (sometimes) be fixed.
We will limit our discussion to 1 sample per pixel (1 SPP) per frame just so that the concept of a frame is meaningful. I am not sure if this is reasonable budget or not but it makes it easier to compare different scenarios. Note that when doing multi-bounce a sample is a “path” that consists of multiple rays (depending on how many bounces the path experience). Outdoors the number of bounces is generally low and in-doors the number of bounces is generally high.
For now I think focusing on path-tracing for realtime is probably the wrong thing to do for most games. Fixing soft/textured shadows and having proper reflections/AO seems good enough. With denoising and more advances it can probably be done, especially once we have even faster GPUs and more knowledge, so feel free to research away on it!
There are few different ways to render. Depending on the one chosen there is more or less time for an image to converge.
In 2-5 we have more than one frame to converge over. In 3-4 we must get something nice quick and then improve if the user doesn't change anything. In 1-3 the user expect that animated meshes work.
I expect that at least 1-2 will use hybrid rendering so that the first hit (camera to first surface) is done using rasterization. TAA will handle anti-aliasing instead of how path-tracing usually deals with it. An exception might be for foveated rendering and adaptive rendering that might be better handled using raytracing.
For lightmap baking some of these modes change (an example is that camera navigation doesn't invalidates old results for progressive rendering).
We will start with an example with ONE secondary ray since it is already hard enough.
Lets say we have a fully diffuse scene and we want to render it progressively.
First we fire rays from the camera into the scene. If we are looking at the sky some rays might by terminated. If we are looking at the ground all will survive. The hit points are mostly coherent. With this I mean that adjacent pixels will fire similar rays that will on average end up on surfaces that are close to each other with similar normals. Here it might be needed to compact the set of paths so that we don't end up having idle threads in wavefront/warps.
For progressive rendering we want that if you take 5 samples it should look good for 5 samples, but as you approach say 10 samples it should look good for 10 samples. That way it will converge quickly in the beginning and then keep improving as we wait (unless we restart). For this to work we need random numbers to spawn directions over the hemisphere that has the progressive property that the samples \([0,..,4]\) gives good directions, but so should \([0,..,N]\) for any \(N\). It is quite common that we tabulate these for say \(N=1024\) and then after that we just do random white noise instead since there is little benefit to using say blue-noise distribution or so.
Now we are at the surface that is visible from the camera and we want to choose a secondary direction (guided by the diffuse brdf). The issue then is that if adjacent pixels use the same random directions there will be severe banding artifacts. This is due to the fact that the error of the integrals are very correlated between the pixels. As the number of samples in the sequence goes up this correlation will mostly disappear but if we have an reasonable low amount of samples it will look really bad. It usually manifest itself as big constant areas.
Think of it like this. If the camera see a big plane all the diffuse sampling will be started from that plane. If all pixels use the same random sequence they will all send their first ray in the same direction (but from slightly different points). Now lets imagine that there is an area-light shining on the floor. Since all rays goes in the same direction the result from the first frame (1 SPP) will look as a rectangle on the ground. The next frame will add another such rectangle and the same for every frame. This will give areas with constant color which is not nice. If we take an infinite number of samples if will look OK in the end, but we don't have time for that today. Instead we want to trade banding for noise. Noise is something the user can accept (or it can be removed by a de-noiser). Banding is not so easy to dispense with.
An easy way to break up the banding is to make sure all pixels use a different random sequence. Lets say that we generate 16×16 unique random sequences. We then tile them over the frame such that after 16 pixels we reuse a random sequence. After 16 pixels the rays that go in the same direction will probably see different things so it can still look good. Or we try 32×32. A good value depends on content and resolution. The random angle should probably be based on blue-noise or something else so that pixels are different but not too different.
Problem solved? Not quite. Now adjacent pixels create rays going in wildly different directions. Incoherence.
A solution to this is introduced in the paper Interleaved Sampling. The general idea that I took from it is that while adjacent pixels must have different random sequences, pixels some way apart can share random sequences. If we reorder the processing of pixels in our frame we can make sure that we create a batch of rays that uses the same random sequence but are located some distance apart.
To make this concrete lets say that we are doing our own GPU raytracing. We use one “thread” in our warp/wave per ray. We then want to make sure that all threads in a warp corresponds to pixels that are sort of close and using same random sequence (hence going in roughly same direction). This also means that once all the rays in our wave/warp has reached the target and we want to fire even more rays, they are at least maybe close to each other. And if we are outside hopefully must of them hit the environment so we don't have to process them at all.
Note that this would probably also work well for “Realtime rendering with TAA”. The random sequence would be different but the general idea would work out.
Here we are not concerned with being able to produce a good image after each frame. This helps us find coherency. Lets say we want to do 256. If we let all pixels have the same 256 random directions there will be banding instead of noise. If we do total randomization there will be noise but no coherency.
What do we really want? For a given frame we want all random directions to be roughly the same. They don't have to be exactly the same, but we want rays that are fired close to each other and end up at a surface to shoot secondary rays in roughly the same direction. Here micro-jitter is a perfect solution. It takes one random directions and perturb it just enough to make the pixels not have correlated values. No banding but noise. The idea is that we don't rotate things widly. Sample N goes roughly in the same direction, but each pixel move it just a little bit different.
After frame 0 we will have a very banded image so no good for progressive rendering. But once we've gotten to 256 samples each pixel will have used different random directions. The key here is that once we've taken all 256 samples we don't rememeber what order we took them in. And taken as a whole the two pixels have different random directions.
The higher the sample count, the smaller pertubations we need.
To see why this work consider a hemisphere with well-placed points. Now we form voronoi-regions around each point and then we move the points randomly within their voronoi region, differently for different pixels. Then we shuffle the order of the points, differently for different pixels. Since order is forgotten (since we don't care about what happens early on for low sample count) this works out.
In the micro-jitter paper they talk about how this can help with multi-bounce as well, but I haven't wrapped my head around that yet. My thinking right now is that it is wrong, but I am probably the one being wrong :)
Here is the paper: Cache-Friendly Micro-Jittered Sampling.
Now lets say that we've shot our camera rays and our first level of coherent diffuse rays (using interleaved sampling or micro-jitter perhaps). Now we are at the second bounce positions. Hopefully many paths has been terminated (if we are outside) but some will live on. Maybe we should compact rays again. But can we create coherent rays yet again?
Not easily in my experience. Lets keep the idea of micro-jitter in mind. After the first bounce we send all rays rouhgly in the same direction. If we imagine 4×4 pixels all intersection a floor and all taking the same direction away from there, they might all end up at roughly the same surface. Now to give those 4×4 pixels uncorrelated values we need them to go in different directions. We need them to be incoherent. Failing to do so usually ends up with the diffuse lobe looking more like a reflection.
The micro-jitter paper says that it works out OK anyway but I have my doubts so I need to go through it in depth. So take it with a grain of salt.
Here we can maybe afford to do multi-bounce GI depending on our budget. Since we don't need to present a good image after each frame we can let the result have banding until we converge. It is important that we compact the rays so that paths that die don't consume “dead” threads in wave/warps. This is probably handled implicitly behind the scenes in DXR. If you are doing you own GPU implementation then there doesn't have to be explicit compaction. It can be handled by path restarting or persistent threads and some sort of ray queue. Important part is to not leave threads idle.
An important factor here is that the first pass from the camera to the scene is often very coherent so you don't want to mix in too much secondary rays in wave/warps that are doing such rays. This is an argument against too naive ray-restarting.
In this mode we will have a massive amount of rays that we can process before we need to show anything. Latency is OK. Thus we have a situaton where we can bin/sort rays and defer them until we have full coherent “buckets”. A sobering read here is this paper but I am not sure if it applies or not in this new GPU world.
Least we can do is bin on direction (positive x, negative x, ...) and maybe starting position.
An interesting read from the CPU-world is Faster Incoherent Rays: Multi-BVH Ray Stream Tracing to show you the type or ray sorting that could be done. What works depends very much on rays and the machine you are running on. In the world of DXR BVH-traversal can probably not be manipulated but it could implement something like this on the inside.
Before I leave I want to point out this really nice paper. It highlights what can happen if you have really expensive shading. Megakernels Considered Harmful: Wavefront Path Tracing on GPUs. It will be interesting so see if the key takeaways here change now that we have callable shaders in DXR.
Thanks to Alan Wolfe for proof-reading and ideas.