Secrets of Direct3D 12: Copies to the Same Buffer

Wed
04
Mar 2020

Modern graphics APIs (D3D12, Vulkan) are complicated. They are designed to squeeze maximum performance out of graphics cards. GPUs are so fast at rendering not because they work with high clock frequencies (actually they don't - frequency of 1.5 GHz is high for a GPU, as opposed to many GHz on a CPU), but because they execute their workloads in a highly parallel and pipelined way. In other words: many tasks may be executed at the same time. To make it working correctly, we must manually synchronize them using barriers. At least sometimes...

Let's consider few scenarios. Scenario 1: A draw call rendering to a texture as a Render Target View (RTV), followed by a draw call sampling from this texture as a Shader Resource View (SRV). We know we must put a D3D12_RESOURCE_BARRIER_TYPE_TRANSITION barrier in between them to transition the texture from D3D12_RESOURCE_STATE_RENDER_TARGET to D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE.

Scenario 2: Two subsequent compute shader dispatches, executed in one command list, access the same texture as an Unordered Access View (UAV). The texture stays in D3D12_RESOURCE_STATE_UNORDERED_ACCESS, but still if the second dispatch needs to wait for the first one to finish, we must issue a barrier of special type D3D12_RESOURCE_BARRIER_TYPE_UAV. That's what this type of barrier was created for.

Scenario 3: Two subsequent draw calls rendering to the same texture as a Render Target View (RTV). The texture stays in the same state D3D12_RESOURCE_STATE_RENDER_TARGET. We need not put a barrier between them. The draw calls are free to overlap in time, but GPU has its own ways to guarantee that multiple writes to the same pixel will always happen in the order of draw calls, and even more - in the order of primitives as given in index + vertex buffer!

Now to scenario 4, the most interesting one: Two subsequent copies to the same resource. Let's say we work with buffers here, just for simplicity, but I suspect textures work same way. What if the copies affect the same or overlapping regions of the destination buffer? Do they always execute in order, or can they overlap in time? Do we need to synchronize them to get proper result? What if some copies are fast, made from another buffer in GPU memory (D3D12_HEAP_TYPE_DEFAULT) and some are slow, accessing system memory (D3D12_HEAP_TYPE_UPLOAD) through PCI-Express bus? What if the card uses a compute shader to perform the copy? Isn't this the same as scenario 2?

That's a puzzle that my colleague asked recently. I didn't know the immediate answer to it, so I wrote a simple program to test this case. I prepared two buffers: gpuBuffer placed in DEFAULT heap and cpuBuffer placed in UPLOAD heap, 120 MB each, both filled with some distinct data and both transitioned to D3D12_RESOURCE_STATE_COPY_SOURCE. I then created another buffer destBuffer to be the destination of my copies. During the test I executed few CopyBufferRegion calls, from one source buffer or the other, small or large number of bytes. I then read back destBuffer and checked if the result is valid.

g_CommandList->CopyBufferRegion(destBuffer, 5 * (10 * 1024 * 1024),
    gpuBuffer, 5 * (10 * 1024 * 1024), 4 * (10 * 1024 * 1024));
g_CommandList->CopyBufferRegion(destBuffer, 3 * (10 * 1024 * 1024),
    cpuBuffer, 3 * (10 * 1024 * 1024), 4 * (10 * 1024 * 1024));
g_CommandList->CopyBufferRegion(destBuffer, SPECIAL_OFFSET,
    gpuBuffer, 102714720, 4);
g_CommandList->CopyBufferRegion(destBuffer, SPECIAL_OFFSET,
    cpuBuffer, 102714720, 4);

It turned out it is! I checked it on both AMD (Radeon RX 5700 XT) and NVIDIA card (GeForce GTX 1070). The driver serializes such copies, making sure they execute in order and the destination data is as expected even when memory regions written by the copy operations overlap.

I also made a capture using Radeon GPU Profiler (RGP) and looked at the graph. The copies are executed as a compute shader, large ones are split into multiple events, but after each copy there is an implicit barrier inserted by the driver, described as:

CmdBarrierBlitSync()
The AMD driver issued a barrier in between back-to-back blit operations to the same destination resource.

I think it explains everything. If the driver had to insert such a barrier, we can suspect it is required. I only can't find anything in the Direct3D documentation that would explicitly specify this behavior. If you find it, please let me know - e-mail me or leave a comment under this post.

Maybe we could insert a barrier manually in between these copies, just to make sure? Nope, there is no way to do it. I tried two different ways:

1. A UAV barrier like this:

D3D12_RESOURCE_BARRIER uavBarrier = {};
uavBarrier.Type = D3D12_RESOURCE_BARRIER_TYPE_UAV;
uavBarrier.UAV.pResource = destBuffer;
g_CommandList->ResourceBarrier(1, &uavBarrier);

It triggers D3D Debug Layer error that complains about the buffer not having UAV among its flags:

D3D12 ERROR: ID3D12GraphicsCommandList::ResourceBarrier: Missing resource bind flags. [ RESOURCE_MANIPULATION ERROR #523: RESOURCE_BARRIER_MISSING_BIND_FLAGS]

2. A transition barrier from COPY_DEST to COPY_DEST:

D3D12_RESOURCE_BARRIER transitionBarrier = {};
transitionBarrier.Type = D3D12_RESOURCE_BARRIER_TYPE_TRANSITION;
transitionBarrier.Transition.pResource = destBuffer;
transitionBarrier.Transition.StateBefore = D3D12_RESOURCE_STATE_COPY_DEST;
transitionBarrier.Transition.StateAfter = D3D12_RESOURCE_STATE_COPY_DEST;
transitionBarrier.Transition.Subresource = D3D12_RESOURCE_BARRIER_ALL_SUBRESOURCES;
g_CommandList->ResourceBarrier(1, &transitionBarrier);

Bad luck again. This time the Debug Layer complains about "before" and "after" states having to be different.

D3D12 ERROR: ID3D12CommandList::ResourceBarrier: Before and after states must be different. [ RESOURCE_MANIPULATION ERROR #525: RESOURCE_BARRIER_MATCHING_STATES]

Bonus scenario 5: ClearRenderTargetView, followed by a draw call that renders to the same texture as a Render Target View. The texture needs to be in D3D12_RESOURCE_STATE_RENDER_TARGET for both operations. We don't put a barrier in between them and don't even have a way to do it, just like in the scenario 4. So Clear operations must also guarantee the order of their execution, although I can't find anything about it in the DX12 spec.

What a mess! It seems that Direct3D 12 requires putting explicit barriers between our commands sometimes, automatically synchronizes some others, and doesn't even describe it all clearly in the documentation. The only general rule I can think of is that it cannot track resources bound through descriptors (like SRV, UAV), but tracks those that are bound in a more direct way (as render target, depth-stencil, clear target, copy destination) and synchronizes them automatically. I hope this post helped to clarify some situations that my happen in your rendering code.

Comments | #directx #rendering Share

Comments

[Download] [Dropbox] [pub] [Mirror] [Privacy policy]
Copyright © 2004-2020