Musings on cross-platform graphics engine architectures – Part 2

This is part 2 of a series on graphics engine architecture. You can read part 1 here.

Part 2 – Multi-threaded command recording and submission

When we take a closer look at any contemporary PC graphics API, we can identify two major components: One to allocate and manage graphics memory, generally in the form of textures, buffers, heaps, views/descriptors, etc. and one to issue commands to your GPU through some type of abstracted command list or device context API. These are your engine’s primary tools to communicate with your graphics driver, which in turn relays the information you give it to your graphics hardware.

A breakdown of the two major graphics API components

In any graphics application you write you’ll want to make sure that these components are used correctly and efficiently. The general idea is that you want to interact with them as little as you can get away with, as each call into your device driver has a certain CPU-side cost associated with it. Redundant API calls can also have a negative impact on GPU performance, as various state changes can have a hidden overhead associated with them (see AMD’s excellent piece on demistifying context rolls as an example).

For a single hardware platform and a single graphics API this can get complicated already. In the era of console-like graphics development on PC where we need to take on various responsibilities the driver used to take care of, it’s hard to keep track of what minimal and efficient API usage actually means. There’s usually no single correct way of doing things, so it’s on you to experiment and figure out what works for your application and what you want to expose to the users of your library. Throw in a couple of additional platforms and graphics APIs to support, and you’ll quickly start introducing inefficient usage of your graphics APIs.

I’d like to walk you through a model of resource management and command submission where each platform has the freedom to make the right decisions for itself, regardless of what another platform might need. In addition to that I want to guarantee proper multi-threading support in this model so we can efficiently scale up our command submission code to as many threads as we’d like. Let’s start off with a look at a flexible resource management setup.

GPU resource creation and management

Let me first explain what I mean when I say “GPU resource”: A resource is any object created by your underlying graphics API which can be consumed, manipulated or inspected by some form of command submission API (e.g. a command list or device context). This definition spans a pretty wide range of items, from memory heaps, buffers and textures to descriptor tables, pipeline state objects and root signatures (or their equivalents in your API of choice).

Before we talk about submitting commands, we need to explain how we want to represent these resources when using them in our commands. For my purposes that representation comes in the form of opaque handle types.

Handles are essentially strongly typed integers. You can implement these in various ways: You could wrap an integer type into a struct, or define an enum class with a sized integer backing. The important thing here is that they’re strongly typed; you don’t want to be able to assign a handle representing a descriptor table to a handle representing a 2D texture, for example.

// An example of a strongly typed 32-bit handle using an enum class
// Note: A typedef or using statement won't work here as this won't provide strong typing
enum class Tex2DHandle : uint32 { Invalid = 0xFFFFFFFF };

// The same handle type using the struct approach
struct Tex2DHandle { uint32 m_value; };

A handle is going to become a unique representation for a given resource. The handle itself is entirely opaque and won’t be conveying any direct information about what resource it represents (as far as your application is concerned at least). This gives the underlying system the freedom to represent and organize resources in whatever way it wants; all it needs to do is guarantee that there’s a one-to-one mapping of an encoded handle to whatever backing representation it holds on to.

There are some pretty awesome benefits to using handles to represent resources. I called out one of them above: giving your engine control over how resources are represented and laid out. Another one is that you avoid the mess of trying to design a common interface to represent a resource across various platforms. You’ll sometimes see a virtual interface for resources, or shared classes which provide different per-platform implementations using preprocessor defines. These types of constructs are prone to becoming messy over time, especially when maintained by various people over various platforms. Handles don’t push you towards solutions like this, instead giving the platform the option to represent a resource as it sees fit under the hood.

Another cool aspect to handles is that it makes it easier to represent resources as pure data. There’s no API surrounding a resource. There are no methods to call which will resize a texture, or generate mips, or do some other random operation, nor will you encounter the temptation to add more bloated and unneeded operations. There are no painful situations where graphics APIs are completely incompatible (e.g. descriptor tables vs. resource views). My preference is to not even provide a way to query any type of resource properties (e.g. texture width, height, mip count, etc), with the idea being that if your application can request a resource with a given set of properties at one point during its lifetime, it should be able to store those properties somewhere in a form that fits the application if they are to be of interest later on. It’s about having a very clearly defined use case for your data, and defining very clear minimal responsibilities for your engine around that data. A resource handle can either be created, destroyed, or used in a command submission API. That’s all there is to it.

One last property of handles I explicitly want to call out is that they can remove the need for passing in pointers or references to resources into your engine, drastically improving your debugging story. A 32 bit handle has more than enough space to encode some type of implementation-specific validation information regarding what the handle is supposed to represent. This means safe access to resources at all times, and sane error reporting in case an invalid handle is passed into your engine layer.

Now that we’ve talked about handles, we can have a look at how we’ll be using them when we’re submitting work to the GPU.

Command submission: Wrangling state

Let’s clarify what we want to get out of our command submission system. We mentioned both scalability across multiple threads and minimal interaction with the underlying API already, but there are a few other things I want to achieve. One of them is a way to avoid state leakage.

Native command submission APIs (e.g. ID3D11DeviceContext or ID3D12GraphicsCommandList) are in a sense stateful. You plug pieces of state into them, or flip a handful of switches before issuing an operation such as a draw call or a compute dispatch. Any operation from setting a vertex buffer to binding a pipeline state effectively changes state around in your command API. There’s nothing inherently wrong with this, but it isn’t uncommon for some side-effects to occur because of this stored state. One of them is something we call state leakage, and this can become quite harmful the more your codebase grows.

Take the following trivial pseudocode as an example:

void RenderSomeEffect()
  // Set up everything required to make your draw call

  // Bind a texture to slot 1
  SetTexture(1, ...);

void RenderSomeOtherEffect(EffectOption options)
  // Draw setup

void RenderAllTheThings()

The blend state we set in RenderSomeEffect will leak into the draw we make in RenderSomeOtherEffect, as these calls happen one after another, and RenderSomeOtherEffect does not specify a blend state of its own. In some codebases this can be desired behavior, but reliance on state leakage can often cause odd bugs when state being leaked from a system gets removed. Anything rendered in RenderSomeOtherEffect might start to rely on the blend state set in RenderSomeEffect, which could cause some very annoying bugs when RenderSomeEffect is changed or moved around.

Disallowing state leakage luckily isn’t all too difficult. The first step we need to take is to build an application-facing API in which state lifetime or state scope is well defined. The purpose of a scope is to clearly define the boundaries of when a piece of state is set, and when that piece of state gets invalidated again, similar to the concept of RAII in C++. You can define a state scope at various levels of granularity, but for this example we’ll look at two level of state scoping: a render pass scope and a draw/dispatch packet scope.

A render pass scope lasts throughout the render pass that’s currently being drawn and defines only those pieces of state that are required by all draws or dispatches being executed in that pass. This might include a set of render targets, a depth buffer, a pass-specific root signature and any per-pass or per-view resources (i.e. textures, constant buffers, etc.).

// Example render pass data. Implementation of this is up to you!
struct RenderPassData
  RootSignature m_passRootSignature;
  RenderTargetHandle m_renderTargets[8];
  DepthTargetHandle m_depthTarget;
  ShaderResourceHandle m_shaderResources[16];

// You could implement a scope as an RAII structure. 
// Don't worry about the CommandBuffer argument, we'll get to that in a bit!
RenderPassScope BeginRenderPassScope(CommandBuffer& cbuffer, const RenderPassData& passData);

// You could also just provide simple begin/end functions
void BeginRenderPassScope(CommandBuffer& cbuffer, const RenderPassData& passData);
void EndRenderPassScope(CommandBuffer& cbuffer);

A packet scope is a scope that last for just a single draw or dispatch operation and is essentially a fully defined description of all resource needed to fully execute a draw or a dispatch (outside of what was set in the render pass scope). For draw packets this could include vertex and index buffers, a graphics pipeline state, a primitive topology, all per-draw resource bindings and the type of draw you want to execute with all parameters for that draw. A compute packet is simpler in that it just defines a compute pipeline state, a set of per-dispatch resource bindings and the type of dispatch (regular or indirect) with accompanying parameters. You could choose whether draw and compute packets are allowed to temporarily override any resources set in the render pass scope or not.

// Example of what a render packet could look like. Again, this is up to you!
struct RenderPacket
  PipelineState m_pipelineState;
  VertexBufferView m_vertexBuffers[8];
  IndexBufferView m_indexBuffer;
  ShaderResourceHandle m_shaderResources[16];
  PrimitiveTopology m_topology;

// Example draw operation using a render packet
void DrawIndexed(CommandBuffer& cbuffer, const RenderPacket& packet);

With this setup you have a guarantee that state can’t leak out of any draw or render pass. A new render pass scope can’t begin until the last one has ended (you can easily enforce this), and ending a scope means that all state defined by that scope is invalidated. Application interaction with your engine layer is now relatively safe; it’s up to you now to guarantee safety and optimal API usage in your engine internals.

Here’s a pseudo-code example of what working with scopes could look like:

void RenderSomeRenderPass(CommandBuffer& cb, const array_view<RenderableObject>& objects)
  // Begin with a render pass scope. 
  // This will bind all state provided by GetRenderPassData
  // Because we're using RAII this scope will end at the end of this function body
  RenderPassScope passScope = BeginRenderPassScope(cb, GetRenderPassData());

  // Build render packets for our objects and submit them
  for (const RenderableObject& obj : objects)
    RenderPacket packet = BuildRenderPacketForObject(obj);
    DrawIndexed(cb, packet);


Command submission: Recording and execution

So far we’ve discussed how we interact with resources and how we interact with state. To tie it all together, let’s talk about recording, submitting and executing commands.

It’s tempting to start off writing a typical abstraction layer around the command list concept. You could create a command list class with an API to record GPU operations such as submitting packets or doing memory copy operations. As hinted at in one of the pseudocode snippets above, I’d like to approach things a little different. Ideally I’d like to have my application graphics code decoupled from direct interaction with an API like D3D or Vulkan. To achieve this we can introduce the concept of command buffers.

A command buffer is a chunk of memory which we write a series of commands into. A command is a combination of a header or opcode followed by the parameters for that command (e.g. a “draw render packet command” would have a “draw render packet” opcode followed by a full render packet description). We’re essentially writing a high level program which we send off to the engine layer to interpret. This turns the engine layer into a server which processes a full sequence of commands in one go, rather than accepting commands one by one.

Building an API around this command buffer concept is very simple, as you’re at the point where you’re almost directly implementing the “video player” concept I talked about in part 1. A command buffer API can be a simple set of free functions which push commands onto your buffer. If you want to support a new platform which can do some type of exotic operation, it could be absolutely fine to introduce a new set of functions adding support for those commands only on that platform. We’re not using any interfaces, no bloated command list classes, no pImpl or other over-complicated C++ nonsense. Just a plain extensible C-like API will suffice.

void Draw(CommandBuffer& cb, const RenderPacket& packet);
void DrawIndex(CommandBuffer& cb, const RenderPacket& packet);
void DrawIndirect(CommandBuffer& cb, const RenderPacket& packet, BufferHandle argsBuffer);

void Dispatch(CommandBuffer& cb, uint32 x, uint32 y, uint32 z);

void CopyResource(commandBuffer& cb, BufferHandle src, BufferHandle dest);

#if defined (SOME_PLATFORM)
void ExoticOperationOnlySupportedOnSomePlatform(CommandBuffer& cb);


When it comes to the engine layer implementation of command buffers, you get complete control over the translation of your command buffers to native API calls. You can re-order commands, sort them, add new ones or even straight up ignore some of them if that’s appropriate to do for your particular platform (some operations might not be supported on your platform, but could be fine to ignore!). Handles can be interpreted as your engine layer sees fit  (Remember: No naked pointers in command buffers!) When implementing this, always keep in mind what your command buffer layout, your parsing logic and your resource handle resolve logic will do in terms of memory access. Keep cache coherency at as high of a priority as efficient graphics API usage, because bad memory access patterns will kill performance in a system like this.

Command submission: Going wide

Because we’re using the concept of isolated command buffers, multi-threaded command recording becomes easy (if you’re accessing any sort of shared or global state when recording a command buffer, you’re doing it wrong!). A single command buffer might not be thread-safe, but you should never find yourself in a situation where you’d want to share a command buffer between threads. Command buffers should be small and cheap, so go wide with them!

In terms of multi-threading there’s one missing piece of the puzzle though, which is multi-threaded recording and submission of native graphics API command lists (if those are available to you). To achieve this I use a model I like to call the record-stage-commit model.

The record-stage-commit model

The record-stage-commit model

The recording aspect is what we discussed initially: recording commands into a command buffer objects using our C-like command API. The staging aspect consists of two parts: The first one is the actual translation of our command buffer into a native command list. With an API like D3D12, this would mean building an ID3D12GraphicsCommandList object. The second aspect of staging is queuing your command list for execution, together with a sort key determining when your command list should get executed relative to other command lists in your commit. It’s important to note that you should keep the staging aspect entirely free-threaded, as you want to do this on many threads at once. You could achieve this by using thread-local storage, or some form of thread-safe queue to form a list of command list and sort key pairs.

Your commit now becomes a bundled execution of all staged command lists once all recording and staging jobs are completed. The commit operation will take all queued command lists, sort them according to their sort key, and then submit them using an ExecuteCommandLists-like API. This model now gives you an API in which you can construct large graphs of rendering jobs on any level of granularity you want. You could simultaneously record all of your render passes over as many threads as you like, while guaranteeing ordering on final submission.

If you’re working with D3D11 or OpenGL, and you can’t easily build native command lists over many threads, you can still do multi-threaded recording of command buffers. Your staging step will just stage raw command buffers, and your commit step will be the one actually parsing and translating these buffers into native API calls. It’s not as ideal as the multi-threaded command recording case in D3D12 or Vulkan, but it at least gives you some form of scalability!

Addendum: Cool things to do with command buffers

Just for fun, here’s a list of other neat things you can do with command buffers:

  • Save/load them to/from disk
    • Build graphics code in your content pipeline! Write custom graphics capture tools!
    • Dump out the last staged command buffers on a graphics-related crash for debugging
  • Use them as software command bundles to optimize rendering
  • Send them over the network to remotely diagnose graphics issues
  • Build them in a compute shader (not sure what this would gain you, but you can do it!)
  • And so much more!

There’s a lot of potential in these things for tools, debugging and optimization purposes. Go nuts!

Closing up

Whew, that was a long one! That’s all I had to share for this entry. Feel free to leave a comment here or ping me on twitter @BelgianRenderer. I’m sure there will be plenty of opinions on some of the concepts I’ve discussed here, and I’d be glad to hear them and discuss them with you. Because this post got a bit larger than I had anticipated I didn’t go super in depth into some concepts, so please feel free to ask if you have questions or concerns. I prefer explaining concepts at a higher level rather than talking about nitty-gritty implementation details. If you do want a more detailed overview of a specific section of this post, just send me a message!

Thank you for reading! See you in Part 3, where we’ll be talking about efficiently working with native API concepts!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.