Hello! I’m Cristiano Ferreira, one of the Developer Relations Engineers at Oculus. One of my team’s primary responsibilities is to help game developers optimize their games to run as efficiently as possible on all of our hardware offerings. In my experience getting Oculus Rift ports to run at perf on Oculus Quest, i’ve found quite a few easily identifiable rendering patterns that could help point towards opportunities for optimization on both the render thread and GPU. The following will show you how to identify issues and how to alleviate them. If you haven’t checked out the previous articles in the series, see the links below:
Forward rendering is rather straightforward... ha! Sorry. Forward rendering simply draws the mesh that you dispatch via draw call and renders directly to the frame buffer. Lighting is calculated in the fragment and/or vertex shaders and applied to the model in the draw call it is issued. The negative effect of this method is that dynamic / important lights may require additional passes. GPU cost is determined by the amount of geometry per-model, the amount of required passes and the lighting model complexity where in deferred rendering, lighting is calculated in screen space.
Deferred Rendering basically means that you defer lighting calculations until after you complete your base pass. The base pass in deferred rendering is different than the base pass in forward rendering in that you are writing to multiple render targets called a Gbuffer (geometric buffer). The Gbuffer contains multiple render targets that define albedo (color), normals, roughness values, etc., really anything you need to calculate lighting without the need to re-run the model through the graphics pipeline. The next phase after writing all of these values is to do a lighting pass in screen space using the previously generated g-buffer. The benefit to this method is that lighting calculations are done in screen space which decouples geometric complexity from lighting. The issue with deferred rendering is that in the lighting pass phase you need to resolve however many render targets are in your GBuffer which can eat up nearly half of your frame budget alone. This doesn’t even include the lighting pass which is fairly expensive since you need to calculate lighting for each pixel on the very high-resolution Oculus Quest screen. Forward rendering is the move!
The above RenderDoc screenshot shows a capture of a game using the deferred rendering method. The arrows show each of the render targets that are all drawn to for each individual draw call. These render targets are then used as input textures for the lighting calculation. For comparison, see the RenderDoc capture of a game using the forward rendering method below:
References: Forward + Deferred Rendering
The generated shader bytecode for each draw call can be viewed within RenderDoc to see exactly what your shader compiles into. The best way to optimize is to look at draw calls that take a relatively high amount of time to complete if geometry count is low and there isn’t a crazy amount of pixel coverage. The more you use RenderDoc the more you start to get a sixth sense for where to look. Developers I work with have found that between 25-50 instructions is generally a sweet spot for shaders to achieve desired look while keeping performance optimal, but you really need to consider the amount of complex math instructions, texture lookups and potential pixel coverage. In general, it’s recommended to make a limited set of uber shaders to maximize batching and minimize setpass / pipeline changes. A side benefit is that you only have to worry about the generated code for a small set of shaders at a time. When using a limited set of uber shaders, performance enhancements of only those few shaders propagates to all meshes in those logical groups. For instance, if your game takes place in a city, you can have an environment shader for buildings, concrete sidewalks, roofs, etc. and another for skinned characters. This will also allow you to easily merge meshes and create mesh LODs (level of detail) / HLODs (hierarchical level of detail) without worrying about the pipeline state breaking them.
Hot Tip: All objects may use different input textures but you can guarantee minimizing setpasses and maximizing batches using texture arrays and a lookup ID (or by creating texture atlases) for as many objects using the same shader as possible.
Here is how you can use RenderDoc to look at the and input textures / uniforms for the currently selected draw call:
LOD (level of detail) systems allow you to swap out super complex detailed meshes when viewed up close to lower geometry meshes that approximate the same model when viewed from far away. The reason why you don’t see models popping when switching out with well-made LODs is because the amount of pixels covered by meshes decrease with distance, as long as the scale stays the same. Do you really want to render a 100k vert mesh of a statue 500 world units away when it only covers a handful of pixels? NO! Take full advantage of these built-in LOD systems that modern commercial game engines give you.
You can sanity check your LOD system to make sure objects that appear far away, near the far clip plane, are using the lowest LOD associated by investigating the vertex count in RenderDoc and matching that up with the object’s pixel coverage. You can then trace those issues back to your assets and make sure LOD’s use a proper camera distance to swap them out.
In the image below, I set up a simple scene that puts a geometrically dense stove model very far away from the camera:
HLOD (hierarchical level of detail systems) are an extension of LOD systems that help in saving draw calls for large clusters of far away objects. It’s best to explain with an example:
What if you are trying to render a large city and you have many blocks where each object is rendered with their own draw call? Even if the far away blocks contain the lowest LODs for each mesh that composes the full city block you will still bloat your draw call count by simply drawing each low LOD. With an HLOD system you can merge all of the lowest LOD meshes in far away blocks offline, then tell the engine to render the entire block chunk in one draw call. Keep in mind that if you have an uber environment shader that uses texture atlases or texture array lookups (as mentioned previously), all you need to do is have your meshes merged and UVs remapped. The downside to HLOD systems is that they are must utilize more memory as you are essentially duplicating geometry for all low LODs composing an HLOD, so be sure that you have ample memory available.
References: Geometry & LOD
There are a few different types of draw call / culling systems that engines use to assure games are rendering as efficiently as possible by skipping unnecessary draw calls. I’ll run through the big ones below.
Frustum culling is the practice of skipping draw calls that fall outside of the camera frustum. If an object is outside the frustum, it will not project to any pixels in screen space, therefore that draw call can be discarded to avoid the GPU cost of processing the geometry and the CPU cost of setting up the draw call for no reason. Developers can set a few parameters to determine frustum size (near/far clip plane + FOV) but on the Quest it’s best to leave them as the default settings given by OVRCameraRig. This is important to maintain comfort.
In RenderDoc you can use the Mesh Viewer tab to look at the input mesh and the generated bounding volume. You can even look at how the mesh was projected into screen space by selecting the VS output section.
Draw distance culling means that you will skip drawing objects that are a certain distance away from the camera. The engine will do this automatically if an object is further than the far clip plane mentioned above in the frustum culling section, but for certain objects it might make sense to avoid dispatching draws for smaller objects that are inside the camera frustum that cover just a handful of pixels. Skipping these visually insignificant draw calls will save you CPU time on your main / render thread, and avoid processing the geometry for those objects. If we go back to the stove example in the LOD section, we can do one better and just disable the mesh if it’s too far away to be significant to the context of the game:
This is a culling step that takes place within the camera frustum itself. The idea behind occlusion culling is to skip draw calls for meshes that are completely occluded behind another mesh. A common example for this case is if you’re standing outside of a cabin in the middle of the woods with no windows & the door closed. Do you want to render every single object inside? No! You can do some calculations on the CPU to determine which objects are occluded by others before you need to begin setting up your render thread. This will save you additional draw calls and geometry processing time on the GPU. I briefly mentioned occlusion culling back in the renderqueue section because they are somewhat related. The main difference between the two is z-rejection will still process the geometry and add a draw call for a mesh (saves occluded pixels from executing their fragment shader) whereas occlusion culling will save you the entire occluded mesh (draw call prep, geometry processing and all fragment shader operations).
In RenderDoc, you can see all rendered objects within a frame and use common sense to determine if your game would benefit from implementing an occlusion culling technique. In the screenshots below, I include the final completed render target,followed by a screenshot of RenderDoc stepping through draw calls, about 20% of the way into the frame. In the second screenshot you can clearly see many objects rendering inside of the cabin that didn’t need to be drawn.
What you see in your headset
What’s actually happening (wasted draws and rendering work)
The occlusion culling algorithm’s calculations on the CPU do require processing power to determine occluded meshes, but for some game types and situations it’s worth it. There are many techniques to choose from and both Unity + UE4 have built-in solutions that can be tested for timing and efficiency. Some work better than others for certain situations, so make sure you consider your scene layout / playstyle when making your choice on which implementation to use or create.
In Unity: be sure to leverage DOTS where you can. This is very parallelizable work that’s heavily math oriented so you’ll see tremendous gains.
For UE4: be sure to leverage parallel containers to benefit from multi-core usage.
To make sure each of the previously mentioned culling systems are as effective as possible, make sure that batches don’t have sparse areas of geometry and that the extents of geometry in the batch don’t get too big. If you’ve merged meshes that are geometrically dense in a concentrated area but you have a few long / thin meshes sticking out, it will make the bounding volume generated for the batch very big, leading to lots of false positives in the culling algorithms. This makes all geometry in the batch process zero to very little pixel coverage in many cases. Make sure that the bounding volume generated around mesh batches are as concentrated as possible. Think to yourself, if I had to contain this giant mesh in a glass box how much empty space would there be? Make sure the empty space is always as low as possible, and separate into discrete batches if necessary.
This is not an egregious example (not very much geometry), but it does show how to look at a submitted batch and potentially split them up. The below picture is of a single batch draw. A case could be made that it should be split in to batch 1 and batch 2 if the player is standing in the middle of that bounding volume (in worldspace) to save geometry projection cost. Ultimately, this would be much more important if there were very dense objects in those clusters of the bounding volume:
Sometimes when you’re using a transparent blend state on your material you need to completely fade out the object. Over the years I've found that sometimes developers will forget to disable their mesh renderer components once objects are completely faded out (alpha = 0). This can have cost implications that range from mild (fading out a small consumable object) to extreme (fading in and then out a full screen vignette effect). If you leave a mesh renderer component enabled after completely fading out an object, you might think the engine will take care of skipping the draw call automatically behind the scenes, but this isn’t the case most of the time because they don’t know your intent. What this means is that your draw call will go all the way through the graphics pipeline without updating any pixels, which is a complete waste of a draw call on the CPU and all of that rendering work on the GPU. Some pixel shaders will early-out on a conditional check if alpha = 0 but you’re still wasting precious GPU time processing geometry.
TL;DR look for draw calls in your frame that don’t contribute to the final color buffer and trace those draws back to the engine. This makes a simple conditional check to disable the associated renderer if alpha = 0 in script / blueprint code.
The screenshot below presents a stretched cube with a transparent material. I turn the alpha value down via the diffuse texture alpha channel to fade out the object.
In RenderDoc we can see that the fully transparent object is still rendering but not outputting any color change due to the alpha value being 0.
Within Unity and Unreal be sure to disable the associated renderer for objects that turn fully transparent.
If you find a rendering artifact, you can locate the draw call causing the issue and help narrow your search. An additional benefit of RenderDoc .rdc files is that you can re-run them on any Oculus Quest that shares the same gfx driver version as the device that it was captured on. This is very helpful if you want QA to hunt down performance hotspots and take captures to pass off to those responsible for graphics and performance.
I hope this guide starts you on a grand mystical adventure of smoothing out performance in your games and applications. RenderDoc is one of the most powerful tools in your utility belt for determining what’s actually going on in the GPU and making sure that you’re draw call preparation / render thread on the CPU are as thin as possible. Remember, even if you’re running under budget you can still optimize to make room for additional effects and graphical beauty to bring much more immersion to your players.
Give RenderDoc a download and try it out for yourself! You’ve got nothing to lose except some of that sweet, sweet performance that might be otherwise left on the table.
- Cristiano Ferreira
If you haven’t already, be sure to check out the other articles in this series on RenderDoc: