AMD GPU Services 5.1.1
The AMD GPU Services (AGS) library provides game and application developers with the ability to query information about installed AMD GPUs and their driver, in …
The AMD GPU Services (AGS) library provides game and application developers with the ability to query information about installed AMD GPUs and their driver, in …
Due to architectural differences between Zen and our previous processor architecture, Bulldozer, developers need to take care when using the Windows® APIs for processor and core enumeration. …
The AMD GCN Vulkan extensions allow developers to get access to some additional functionalities offered by the GCN architecture which are not currently exposed in the Vulkan API. One of these is the ability to access the barycentric coordinates at the fragment-shader level.
Thanks (again!) Before we dive into a run over the release notes for the 1.0.2 release of Radeon GPU Profiler, we’d like to thank everyone …
Understanding the instruction-level capabilities of any processor is a worthwhile endeavour for any developer writing code for it, even if the instructions that get executed …
An important part of learning the Vulkan API – just like any other API – is to understand what types of objects are defined in it, what they represent and how they relate to each other. To help with this, we’ve created a diagram that shows all of the Vulkan objects and some of their relationships, especially the order in which you create one from another.
Summary In this blog post we are announcing the open-source availability of the Radeon™ ProRender renderer, an implementation of the Radeon ProRender API. We will give …
Introduction and thanks Effective GPU performance analysis is a more complex proposition for developers today than it ever has been, especially given developments in how …
TressFX 4 introduces a number of improvements. This blog post focuses on three of these, all of which are tied to simulation: Bone-based skinning Signed distance …
Full application control over GPU memory is one of the major differentiating features of the newer explicit graphics APIs such as Vulkan® and Direct3D® 12. …
We are excited to announce the release of Compressonator V2.6. This version contains several new features and optimizations, including: Adaptive Format Conversion for general transcoding operations …
When getting a new piece of hardware, the first step is to install the driver. You can see how to install them for the Radeon …
In this blog we will go through the installation process of the driver for your new Radeon Vega Frontier card. We will go through the …
When using a compute shader, it is important to consider the impact of thread group size on performance. Limited register space, memory latency and SIMD occupancy each affect shader performance in different ways. This article discusses potential performance issues, and techniques and optimizations that can dramatically increase performance if correctly applied.
The AMD Developer Tools team is thrilled to announce the availability of the AMD plugin for Microsoft’s PIX for Windows tool. PIX is a performance …
A new version of the CodeXL open-source developer tool is out! Here are the major new features in this release: CPU Profiling Support for AMD …
When it comes to multi-GPU (mGPU), most developers immediately think of complicated Crossfire setups with two or more GPUs and how to make their game …
Introduction Shortly after our Capsaicin and Cream event at GDC this year where we unveiled Radeon RX Vega, we hosted a developer-focused event designed to …
BC6 HDR Compression The BC6H codec has been improved and now offers better quality then previous releases, along with support for both 16 bit Half …
This article explains how to use Radeon GPU Analyzer (RGA) to produce a live VGPR analysis report for your shaders and kernels. Basic RGA usage …
I’m Mike Schmit, Director of Software Engineering with the Radeon Technologies Group at AMD. I’m leading the development of a new open-source 360-degree video-stitching framework …
AMD LiquidVR MultiView Rendering in Serious Sam VR with the GPU Services (AGS) Library AMD’s MultiView Rendering feature reduces the number of duplicated object draw …
In 2016, AMD brought TrueAudio Next to GameSoundCon. GameSoundCon was held Sept 27-28 at the Millennium Biltmore Hotel in Los Angeles. GameSoundCon caters to game …
Budgeting, measuring and debugging video memory usage is essential for the successful release of game titles on Windows. As a developer, this can be efficiently achieved with the …
Another year, another Game Developer Conference! GDC is held earlier this year (27 February – 3 March 2017) which is leaving even less time for …
With the launch of AGS 5.0 developers now have access to the shader compiler control API. Here’s a quick summary of the how and why…. Background …
There are many games out there taking place in vast environments. The basic building block of every environment is height-field based terrain – there’s no …
Understanding concurrency (and what breaks it) is extremely important when optimizing for modern GPUs. Modern APIs like DirectX® 12 or Vulkan™ provide the ability to …
Summary Many Gaming and workstation laptops are available with both (1) integrated power saving and (2) discrete high performance graphics devices. Unfortunately, 3D intensive application …
This post is taking a look at some of the interesting bits of helping id Software with their DOOM® Vulkan™ effort, from the perspective of …
This blog is guest authored by Croteam developer Karlo Jez and he will be giving us a detailed look at how Affinity Multi-GPU support was …
When opening a 64-bit crash dump you will find that you will not necessarily get a sensible call stack. This is because 64-bit crash dumps …
Vulkan™’s barrier system is unique as it not only requires you to provide what resources are transitioning, but also specify a source and destination pipeline …
This is the third post in the follow up series to my prior GDC talk on Variable Dynamic Range. Prior posts covered dithering, today’s topic …
Virtual desktop infrastructure systems and cloud gaming are increasingly gaining popularity thanks to an ever more improved internet infrastructure. This gives more flexibility to the …
As noted in my previous blog, new innovations in virtual reality have spearheaded a renewed interest in audio processing, and many new as well as …
This week marks the last in the series of our regular Warhammer Wednesday blog posts. We’d like to extent our thanks to Creative Assembly’s Lead …
Audio Must be Consistent With What You See Virtual reality demands a new way of thinking about audio processing. In the many years of history …
Happy Warhammer Wednesday! This week Creative Assembly’s Lead Graphics Programmer Tamas Rabel talks about how Total War: Warhammer utilized asynchronous compute to extract some extra …
It’s Wednesday, so we’re continuing with our series on Total War: Warhammer. Here’s Tamas Rabel again with some juicy details about how Creative Assembly brought …
A new release of the CodeXL open-source developer tool is out! Here’s the hot new stuff in this release: New platforms support Support Linux systems …
We’re back again on this fine Warhammer Wednesday with more from Tamas Rabel, Lead Graphics Programmer on the Total War series. In last week’s post …
For the next few weeks we’ll be having a regular feature on GPUOpen that we’ve affectionately dubbed “Warhammer Wednesdays”. We’re extremely lucky to have Tamas Rabel, …
Game engines do most of their shading work per-pixel or per-fragment. But there is another alternative that has been popular in film for decades: object …
EDIT: 2016/08/08 – Added section on Targeting Low-Memory GPUs This post serves as a guide on how to best use the various Memory Heaps and …
Before Direct3D® 12 and Vulkan™, resources were bound to shaders through a “slot” system. Some of you might remember when hardware did have only very …
Multi-GPU systems are much more common than you might think. Most of the time, when someone mentions mGPU, you think about high-end gaming machines with …
Compressonator is a set of tools to allow artists and developers to more easily create compressed texture image assets and easily visualize the quality impact …
Prior to explicit graphics APIs a lot of draw-time validation was performed to ensure that resources were synchronized and everything set up correctly. A side-effect of this robustness …
Direct3D® 12 and Vulkan™ significantly reduce CPU overhead and provide new tools to better use the GPU. For instance, one common use case for the …
As promised, we’re back and today I’m going to cover how to get resources to and from the GPU. In the last post, we learned …
A new CodeXL release is out! For the first time the AMD Developer Tools group worked on this release on the CodeXL GitHub public repository, …
Today, we are excited to announce that we are releasing an update for ShadowFX that adds support for DirectX® 12. Features Different shadowing modes Union of …
Achieving high performance from your Graphics or GPU Compute applications can sometimes be a difficult task. There are many things that a shader or kernel …
The GCN architecture contains a lot of functionality in the shader cores which is not currently exposed in current APIs like Vulkan™ or Direct3D® 12. One …
A Complete Tool to Transform Your Desktop Appearance After introducing our Display Output Post Processing (DOPP) technology, we are introducing a new tool to change …
Compaction is a basic building block of many algorithms – for instance, filtering out invisible triangles as seen in Optimizing the Graphics Pipeline with Compute. …
We are releasing TressFX 3.1. Our biggest update in this release is a new order-independent transparency (OIT) option we call “ShortCut”. We’ve also addressed some of …
Today’s update for GeometryFX introduces cluster culling. Previously, GeometryFX worked on a per-triangle level only. With cluster culling, GeometryFX is able to reject large chunks …
Full-speed, out-of-order rasterization If you’re familiar with graphics APIs, you’re certainly aware of the API ordering guarantees. At their core, these guarantees mean that if …
A New Milestone After the success of the first version, FireRays is moving to another major milestone. We are open sourcing the entire library which …
Last week, we organized a two hours-long talk at University of Lodz in Poland where we discussed the most common mistakes we come across in Vulkan applications. Dominik Witczak, …
We are very pleased to be announcing that AMD is open-sourcing one of our most popular tools and SDKs. Compressonator (previously released as AMD Compress …
Gaming at optimal performance and quality at high screen resolutions can sometimes be a demanding task for a single GPU. 4K monitors are becoming mainstream and gamers …
If you have supported Crossfire™ or Eyefinity™ in your previous titles, then you have probably already used our AMD GPU Services (AGS) library. A lot of …
Resource creation and management has changed dramatically in Direct3D® and Vulkan™ compared to previous APIs. In older APIs, memory is managed transparently by the driver. …
CodeXL major release 2.0 is out! It is chock-full of new features and a drastic change in the CodeXL development model: CodeXL is now open …
The prior post in this series established a base technique for adding grain, and now this post is going to look at very subtle changes to …
Welcome back to our performance & optimization series. Today, we’ll be looking more closely at shaders. On the surface, it may look as if they …
This is the first of a series of posts expanding on the ideas presented at GDC in the Advanced Techniques and Optimization of VDR Color …
The Game Developer Conference 2016 was an event of epic proportions. Presentations, tutorials, round-tables, and the show floor are only one part of the story …
This post describes how GCN hardware coalesces memory operations to minimize traffic throughout the memory hierarchy. The post uses the term “invocation” to describe one …
Bandwidth is always a scarce resource on a GPU. On one hand, hardware has made dramatic improvements with the introduction of ever faster memory standards …
Vulkan™ provides unprecedented control to developers over generating graphics and compute workloads for a wide range of hardware, from tiny embedded processors to high-end workstation GPUs with wildly different …
The Game Developer Conference 2016 (GDC16) is held March 14-18 in the Moscone Center in San Francisco. This is the most important event for game developers, …
Welcome back to our DX12 series! Let’s dive into one of the hottest topics right away: synchronization, that is, barriers and fences! Barriers A barrier is …
Vulkan™ is a high performance, low overhead graphics API designed to allow advanced applications to drive modern GPUs to their fullest capacity. Where traditional APIs …
Imagine that you were asked one day to design an API with bleeding-edge graphics hardware in mind. It would need to be as efficient as …
Hello and welcome to our series of blog posts covering performance advice for Direct3D® 12 & Vulkan™. You may have seen the #DX12PerfTweets on Twitter, and …
For GPU-side dynamically generated data structures which need 3D spherical mappings, two of the most useful mappings are cubemaps and octahedral maps. This post explores …
I have met enough game developers in my professional life to know that these guys are among the smartest people on the planet. Those particular individuals will go …
About CodeXL Analyzer CLI CodeXL Analyzer CLI is an offline compiler and performance analysis tool for OpenCL™ kernels, DirectX® shaders and OpenGL® shaders. Using CodeXL …
GPU PerfStudio supports DirectX® 12 on Windows® 10 PCs. The current tool set for DirectX 12 comprises of an API Trace, a new GPU Trace …
Today we’re going to take a look at how asynchronous compute can help you to get the maximum out of a GPU. I’ll be explaining …
What’s New With the recent adoption of new APIs such as DirectX® 12 and Vulkan™, we are seeing renewed interest in an older tool. AMD …
A typical problem with MSAA Resolve mixed with HDR is that a single sample with a large HDR value can over-power all other samples, resulting …
We’re back again on this fine Warhammer Wednesday with more from Tamas Rabel, Lead Graphics Programmer on the Total War series. In last week’s post Tamas talked about the pipeline in their last game Total War: Attila. This week we’re going to take a peek inside Creative Assembly at one of the internal tools they developed to help measure frames in both DirectX®11 and DirectX 12. Tamas is also going to talk about one of the many tricks they used to optimize their shaders and why it delivered more performance on GCN hardware. Enjoy!
Before we can start working on any kind of optimization, we must have a way to measure the performance in a consistent way and then to drill down and understand the implications.
We do have a benchmark mode in the Total War games, which can reliable reproduce the same frames over and over. We also have a console with lots of dev commands, including one which captures all the device and rendering calls in a frame and measure them both on CPU and GPU.
Then we use a small ruby script to convert this data to chrome timeline format which can be displayed in chrome://tracing
This is how a typical scene in Total War: Warhammer looks like rendered at 4K. We use the timeline view first to find hotspots, then to validate our improvements by comparing results against previous captures we take with our internal tools.
One of our most important tools in optimization was the shader analyser tool in CodeXL (and formerly GPUPerfStudio). The tool gives us information about register usage for the shaders used. I will walk you through the process of optimizing the pixel shader which combines the 8 layers of a terrain tiles. After running the shader through the analyser, this was our starting point:
Let me explain this a bit.
As we’re not using any additional LDS (Local Data Share) in our pixel shaders (beyond that used for our interpolants), the two resources that we care most about are SGPRs (Scalar General Purpose Register) and VGPRs (Vector General Purpose Register). From the view of a single thread (or a single pixel in this case) both types of register just contain a single 32-bit value. However, at the hardware-level GCN works on groups of 64 threads called wavefronts. From the point of view of a wavefront SGPRs are a single 32-bit values that is shared across all threads in the same wavefront. Conversely, VGPRs have a unique 32-bit value for each thread in the wavefront. One way to think of it is that any values that are constant across a group of 64 threads, for example the descriptors for your textures, can be stored in SGPRs while values that are (or have the potential to be) unique to each thread are stored in VGPRs.
From the point of view of a single thread each of the vector registers hold a single value. 256 VGPRs can mean 64 float4, 128 float2, 256 float or any combination of these. For example, if we sample a texture, but only use its RGB components and not its alpha channel, it will take up 3 VGPRs. Let’s do some math: we want to blend 8 terrain layers. Each layer has a diffuse, a normal and a spec/gloss texture. We use all 4 channels of the diffuse texture, 2 channels of the normal texture and 2 channels of the spec/gloss texture. That’s 4+2+2 = 8 VGPRs per layer. Multiplied by 8 layers is 64 VGPRs. So we’ve already used up a quarter of all the available registers in the SIMD and we haven’t even started to talk about other parts of the code, blend maps, height map, etc. Some registers can be reused, but as we’ll see soon it’s not as trivial as it seems.
The number of used registers is important, because modern hardware runs multiple wavefronts at the same time. You can think about this as processing multiple pixels at the same time. This means that one of the main limiting factors on the number of pixels we can have in flight is the number of registers the shaders require. If a single SIMD in the hardware has 256 VGPRs and a shader is using 200 of them for example, the GPU can work only on one wavefront at a time. After the first wavefront is launched on a SIMD it leaves 56 registers unused, which is not enough to accommodate another wavefront running the same shader. If it’s using 110 VGPRs, two wavefronts can run at the same time (112+112=224 and 32 registers remain unused). If the shader uses only 24 or less VGPRs, the hardware can run 10 wavefronts on a SIMD at the same time. 10 concurrent wavefronts is the current maximum for Fiji GPUs, such as the Radeon® Fury X. This limit is hard wired.
Back to the terrain shader, it was using 104 VGPRs. This means at most two wavefronts can be in flight at the same time, which is 128 pixels as we can have a maximum of 64 pixels in a single wavefront. Reducing the number or VGPRs can increase the number of wavefronts that can be active in parallel and can in some cases result in better performance.
So why are we using that many registers you may ask? We process the textures one by one, there is no need to keep all texture samples in registers all the time. Unfortunately, we don’t have control over how the shader compiler inside the driver translates DirectX Assembly to GCN ISA (which is the machine code for AMD GPUs), nor can we supply the GCN ISA directly (at least on PC). This means the shader compiler inside the driver has to deal with conflicting goals here. One goal as we’ve seen is keeping the register usage count as low as possible. Another goal is to hide as much latency from sampling textures as possible. Let me explain this quickly.
Accessing data from memory (from textures or buffers) can take a long time, but it can happen parallel to both Vector and Scalar ALU execution. As long as we don’t need to use the result of the memory load we can keep working on other instruction to give the memory subsystem enough time to fetch the data. The more computation we can put between issuing the load and using the results, the more likely we won’t have to wait for the load from memory. To achieve this the compiler attempts to issue the vector memory requests which will ultimately write to our VGPRs as early as possible. This means the VGPRs that will ultimately contain the data we read from memory are not available for use by other instructions until our memory access has completed. This will help hiding the latency of the memory request, but on the other hand it can increase register pressure in the shader, which can often lead to the shader requiring more registers. At this time there is no way controlling this behaviour directly.
The way we hinted the compiler to re-use certain registers was to wrap the layer blending inside a loop and use the loop
semantic to ensure the compiler won’t unroll it. This is the result after this quick restructuring:
With this simple trick we managed to claim back 42 registers. Which meant double the wavefront occupancy for this shader. This change translated to an almost two-times speedup in rendering tiles.
For a quick introduction to profiling on the GPU, I recommend the following article:
https://mynameismjp.wordpress.com/2011/10/13/profiling-in-dx11-with-queries/
You can find more information about the GCN architecture here:
http://developer.amd.com/wordpress/media/2013/06/2620_final.pdf
http://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf
If you’d like to see some practical examples on hiding latency and GCN ISA:
https://bartwronski.com/2014/03/27/gcn-two-ways-of-latency-hiding-and-wave-occupancy/
View the other blogs in the Warhammer Wednesday series here.
If you have questions, feel free to comment.
Thank you for that!
My terrain shader was running at 85 VGPR’s and 2 wave fronts, using this got it down to 58 VGPR’s and 4 wave fronts!
Not as big of a win as you guys but I’m doing some other stuff in there too. Here is my new shader code thanks to you guys….. Did I have the right idea?
float2 LocalBlend = LocalBlendLayer.Sample(samHeightmap, TexCoord2).rg;
float4 Paint = (PaintLayer.Sample(samHeightmap, TexCoord2).rgba) * LocalBlend.x;
float4 LocalPaint = (LocalPainLayer.Sample(samHeightmap, TexCoord2).rgba) * LocalBlend.y;
float3 target = Target2D.Sample(samHeightmap, TexCoord2).rgb;
float4 diffuseLayer = 0;
float3 NormalLayer = 0;
float total = Paint.r + Paint.g + Paint.b + Paint.a + LocalPaint.x + LocalPaint.g + LocalPaint.b + LocalPaint.a ;
Paint.rgba /= total;
LocalPaint.rgba /= total;
[loop]
for (uint i = 0; i < 4; i++)
{
TileInfo info = TileInfoBuffer[i];
tilecolor TileColor = triplanar(n, input.WSPos, info.TextureIndex, info.XScale, info.YScale, info.ZScale);
diffuseLayer += TileColor.Color * Paint[i] + TileColor.Color * LocalPaint[i];
NormalLayer += TileColor.Normal * Paint[i] + TileColor.Normal * LocalPaint[i];
}