Anatomy Of The Total War Engine: Part II

Posted on August 3, 2016 by Tamas Rabel

DX12, GCN, LDS, SGPR, shader, Total War, VGPR, Warhammer, WarhammerWed, wavefront

We’re back again on this fine Warhammer Wednesday with more from Tamas Rabel, Lead Graphics Programmer on the Total War series. In last week’s post Tamas talked about the pipeline in their last game Total War: Attila. This week we’re going to take a peek inside Creative Assembly at one of the internal tools they developed to help measure frames in both DirectX®11 and DirectX 12. Tamas is also going to talk about one of the many tricks they used to optimize their shaders and why it delivered more performance on GCN hardware. Enjoy!

Measuring Performance

Before we can start working on any kind of optimization, we must have a way to measure the performance in a consistent way and then to drill down and understand the implications.

We do have a benchmark mode in the Total War games, which can reliable reproduce the same frames over and over. We also have a console with lots of dev commands, including one which captures all the device and rendering calls in a frame and measure them both on CPU and GPU.
Then we use a small ruby script to convert this data to chrome timeline format which can be displayed in chrome://tracing

Data to Chrome Timeline Format Which Can be Displayed in Chrome://tracing

This is how a typical scene in Total War: Warhammer looks like rendered at 4K. We use the timeline view first to find hotspots, then to validate our improvements by comparing results against previous captures we take with our internal tools.

Shader Register Usage

One of our most important tools in optimization was the shader analyser tool in CodeXL (and formerly GPUPerfStudio). The tool gives us information about register usage for the shaders used. I will walk you through the process of optimizing the pixel shader which combines the 8 layers of a terrain tiles. After running the shader through the analyser, this was our starting point:

Let me explain this a bit.

As we’re not using any additional LDS (Local Data Share) in our pixel shaders (beyond that used for our interpolants), the two resources that we care most about are SGPRs (Scalar General Purpose Register) and VGPRs (Vector General Purpose Register). From the view of a single thread (or a single pixel in this case) both types of register just contain a single 32-bit value. However, at the hardware-level GCN works on groups of 64 threads called wavefronts. From the point of view of a wavefront SGPRs are a single 32-bit values that is shared across all threads in the same wavefront. Conversely, VGPRs have a unique 32-bit value for each thread in the wavefront. One way to think of it is that any values that are constant across a group of 64 threads, for example the descriptors for your textures, can be stored in SGPRs while values that are (or have the potential to be) unique to each thread are stored in VGPRs.

VGPRS

From the point of view of a single thread each of the vector registers hold a single value. 256 VGPRs can mean 64 float4, 128 float2, 256 float or any combination of these. For example, if we sample a texture, but only use its RGB components and not its alpha channel, it will take up 3 VGPRs. Let’s do some math: we want to blend 8 terrain layers. Each layer has a diffuse, a normal and a spec/gloss texture. We use all 4 channels of the diffuse texture, 2 channels of the normal texture and 2 channels of the spec/gloss texture. That’s 4+2+2 = 8 VGPRs per layer. Multiplied by 8 layers is 64 VGPRs. So we’ve already used up a quarter of all the available registers in the SIMD and we haven’t even started to talk about other parts of the code, blend maps, height map, etc. Some registers can be reused, but as we’ll see soon it’s not as trivial as it seems.

The number of used registers is important, because modern hardware runs multiple wavefronts at the same time. You can think about this as processing multiple pixels at the same time. This means that one of the main limiting factors on the number of pixels we can have in flight is the number of registers the shaders require. If a single SIMD in the hardware has 256 VGPRs and a shader is using 200 of them for example, the GPU can work only on one wavefront at a time. After the first wavefront is launched on a SIMD it leaves 56 registers unused, which is not enough to accommodate another wavefront running the same shader. If it’s using 110 VGPRs, two wavefronts can run at the same time (112+112=224 and 32 registers remain unused). If the shader uses only 24 or less VGPRs, the hardware can run 10 wavefronts on a SIMD at the same time. 10 concurrent wavefronts is the current maximum for Fiji GPUs, such as the Radeon® Fury X. This limit is hard wired.

Back to the terrain shader, it was using 104 VGPRs. This means at most two wavefronts can be in flight at the same time, which is 128 pixels as we can have a maximum of 64 pixels in a single wavefront. Reducing the number or VGPRs can increase the number of wavefronts that can be active in parallel and can in some cases result in better performance.

So why are we using that many registers you may ask? We process the textures one by one, there is no need to keep all texture samples in registers all the time. Unfortunately, we don’t have control over how the shader compiler inside the driver translates DirectX Assembly to GCN ISA (which is the machine code for AMD GPUs), nor can we supply the GCN ISA directly (at least on PC). This means the shader compiler inside the driver has to deal with conflicting goals here. One goal as we’ve seen is keeping the register usage count as low as possible. Another goal is to hide as much latency from sampling textures as possible. Let me explain this quickly.

Hiding Latency

Accessing data from memory (from textures or buffers) can take a long time, but it can happen parallel to both Vector and Scalar ALU execution. As long as we don’t need to use the result of the memory load we can keep working on other instruction to give the memory subsystem enough time to fetch the data. The more computation we can put between issuing the load and using the results, the more likely we won’t have to wait for the load from memory. To achieve this the compiler attempts to issue the vector memory requests which will ultimately write to our VGPRs as early as possible. This means the VGPRs that will ultimately contain the data we read from memory are not available for use by other instructions until our memory access has completed. This will help hiding the latency of the memory request, but on the other hand it can increase register pressure in the shader, which can often lead to the shader requiring more registers. At this time there is no way controlling this behaviour directly.
The way we hinted the compiler to re-use certain registers was to wrap the layer blending inside a loop and use the loop semantic to ensure the compiler won’t unroll it. This is the result after this quick restructuring:

Register Usage After Restructuring — Register Usage for The Shaders Used After Restructuring

With this simple trick we managed to claim back 42 registers. Which meant double the wavefront occupancy for this shader. This change translated to an almost two-times speedup in rendering tiles.

1 Comment

Thank you for that!
My terrain shader was running at 85 VGPR’s and 2 wave fronts, using this got it down to 58 VGPR’s and 4 wave fronts!

Not as big of a win as you guys but I’m doing some other stuff in there too. Here is my new shader code thanks to you guys….. Did I have the right idea?

float2 LocalBlend = LocalBlendLayer.Sample(samHeightmap, TexCoord2).rg;
float4 Paint = (PaintLayer.Sample(samHeightmap, TexCoord2).rgba) * LocalBlend.x;
float4 LocalPaint = (LocalPainLayer.Sample(samHeightmap, TexCoord2).rgba) * LocalBlend.y;
float3 target = Target2D.Sample(samHeightmap, TexCoord2).rgb;

float4 diffuseLayer = 0;
float3 NormalLayer = 0;

float total = Paint.r + Paint.g + Paint.b + Paint.a + LocalPaint.x + LocalPaint.g + LocalPaint.b + LocalPaint.a ;

Paint.rgba /= total;
LocalPaint.rgba /= total;

[loop]
for (uint i = 0; i < 4; i++)
{
TileInfo info = TileInfoBuffer[i];
tilecolor TileColor = triplanar(n, input.WSPos, info.TextureIndex, info.XScale, info.YScale, info.ZScale);
diffuseLayer += TileColor.Color * Paint[i] + TileColor.Color * LocalPaint[i];
NormalLayer += TileColor.Normal * Paint[i] + TileColor.Normal * LocalPaint[i];
}

AMD GPU Services 5.1.1

CPU core count detection on Windows

Stable barycentric coordinates

Radeon GPU Profiler 1.0.2

AMD Vega Instruction Set Architecture documentation

Understanding Vulkan objects

Open-source Radeon ProRender

Radeon GPU Profiler 1.0

TressFX 4 Simulation Changes

Vulkan Memory Allocator 1.0

Compressonator V2.6 Release Adds HDR Tonemapping Compression, New Image Analysis Features

Vega Frontier : How to for developers

Vega Frontier : How to install the driver

Optimizing GPU occupancy and resource usage with large thread groups

DirectX12 Hardware Counter Profiling with Microsoft PIX and the AMD Plugin

CodeXL 2.3 is released!

Content Creation Tools and Multi-GPU

Capsaicin and Cream developer talks at GDC 2017

Compressonator V2.5 Release Adds Enhanced HDR Support

Live VGPR Analysis with Radeon GPU Analyzer

The Radeon Loom Stitching Pipeline

AMD LiquidVR MultiView Rendering in Serious Sam VR

TrueAudio Next Demo and Paper at GameSoundCon

Profiling video memory with Windows Performance Analyzer

GDC 2017 Presentations

AGS 5.0 – Shader Compiler Controls

Optimizing Terrain Shadows

Leveraging asynchronous queues for concurrent execution

Selecting the Best Graphics Device to Run a 3D Intensive Application

Vulkan and DOOM

Implementing LiquidVR™ Affinity Multi-GPU support in Serious Sam VR

AMD Driver Symbol Server

Vulkan barriers explained

VDR Follow Up – Tonemapping for HDR Signals

Using RapidFire for Virtual Desktop and Cloud Gaming

AMD TrueAudio Next and CU Reservation – What is the Context?

Anatomy Of The Total War Engine: Part V

The Importance of Audio in VR

Anatomy Of The Total War Engine: Part IV

Anatomy Of The Total War Engine: Part III

Blazing CodeXL 2.2 is here!

Anatomy Of The Total War Engine: Part II

Anatomy Of The Total War Engine: Part I

Texel Shading

Vulkan Device Memory

Performance Tweets Series: Root signature & descriptor sets

Performance Tweets Series: Multi-GPU

Compressonator v2.3 Release Delivers ASTC, ETC2 Codec Support and GPU Rendered Image Views

Performance Tweets Series: Debugging & Robustness

Performance Tweets Series: Rendering and Optimizations

Performance Tweets Series: Streaming & Memory Management

CodeXL 2.1 is out and Searing hot with Vulkan

ShadowFX Effect Library for DirectX 12

Turbocharge your Graphics and GPU Compute Applications with GPUPerfAPI

GCN Shader Extensions for Direct3D and Vulkan

AMD DOPPEngine – Post Processing on Your Desktop in Practice

Fast compaction with mbcnt

TressFX 3.1

GeometryFX 1.2 – Cluster Culling

Unlock the Rasterizer with Out-of-Order Rasterization

AMD FireRays 2.0 – Open Sourcing and Customizing Ray Tracing for Efficient Hardware Platforms Support

Slides from our “The most common Vulkan mistakes” talk

Compressonator (AMD Compress) is Going Open Source

AMD Crossfire API

AMD GPU Services, an introduction

Performance Tweets Series: Resource Creation

CodeXL 2.0 is Here and Open Source

VDR Follow Up – Grain and Fine Details

Performance Tweets Series: Shaders, Threading, Compiling

VDR Follow Up – Fine Art of Film Grain

GDC 2016 Presentations Available

GCN Memory Coalescing

Delta Color Compression Overview

Using the Vulkan™ Validation Layers

GDC 2016 Presentations

Performance Tweets series: Barriers, fences, synchronization

Vulkan Renderpasses

Say Hello to a New Rendering API in Town!

Performance Tweets Series: Command lists

Fetching From Cubes and Octahedrons