The life and death of objects
This is a part 2 in a series where I explore Granite‘s Vulkan backend. See part 1 for an introduction. In this blog entry we will dive into code, and we will start with the basics. Our focus in this entry will be to discuss object lifetimes and how Granite deals with the Vulkan rule that you cannot delete objects which are in use by the GPU.
Sample code structure
I will be referring to concrete code samples from here on out. I have started a small code repository which contains all the samples. See README.md for how to build, but you won’t need to run the samples to understand where I’m going with these samples. Stepping through the debugger can be rather helpful however.
Sample 01 – Create a Vulkan device
Before we can do anything, we must create a VkDevice. This aspect of Vulkan is quite dull and full of boilerplate, as is the setup code for any graphics API. There is not a lot to cover from an API design perspective, but there are a few things to mention. The sample code for this part is here: https://github.com/Themaister/Granite-MicroSamples/blob/master/01_device_creation.cpp
The API for this is pretty straight forward. I decided to split up how we load the Vulkan loader library, since there are two main use cases here:
- User wants Granite to load libvulkan.so/dll/dylib from standard locations and bootstrap from there.
- User wants to load an already provided function pointer to vkGetInstanceProcAddr. This is actually the common case, since GLFW loads the Vulkan loader for you dynamically and you can just use the GLFW provided glfwGetInstanceProcAddr to bootstrap yourself. The volk loader has support for this.
To create the instance and device, we need to do the usual song and dance of creating a VkInstance and VkDevice:
- Setup Vulkan debug callbacks
- Identify and enable relevant extensions
- Enable Vulkan validation layers in debug build
- Find appropriate VkQueues to cover graphics, async compute, transfer
Vulkan::Context and Vulkan::Device
The Context owns the VkInstance and VkDevice, and Vulkan::Device borrows a VkDevice and manages the big objects which are created from a VkDevice. It is possible to have multiple Vulkan::Device on top of a VkDevice, but we end up sharing the VkQueues and the global heaps for that device, which is a very nice property of Vulkan, since it allows frontend/backend systems like e.g. RetroArch/libretro to share a VkDevice without having hidden global state leak between the API boundary, which is a huge problem with the legacy APIs like GL and D3D11.
Note that this sample, and all other samples in this chapter are completely headless. There is no WSI involved. Vulkan is really nice in that we don’t need to create window system contexts to do any GPU work.
02 – Creating objects
Creating new resources in a graphics API should be very easy, and here I spent a lot of time on convenience. Creating images and uploading data to them in raw Vulkan is a lot of work, since there are so many things you have to think about. Creating staging buffers, copy those, defer deletion of that staging buffer, maybe we copy on the transfer queue, or not? Emit some semaphores to transfer ownership to graphics queue, creating image views, and just so many things which is very painful to write. Just creating an image in a solid way is several hundred lines of code. Fortunately, this kind of code is very easy to wrap in an API. See sample: https://github.com/Themaister/Granite-MicroSamples/blob/master/02_object_creation.cpp, where we create a buffer and image. I think the API is about as simple as you can make it while keeping a reasonable amount of flexibility.
When we allocate resources, we allocate it from Granite’s heap allocator for Vulkan. If I had done Granite today, I would just use AMD’s Vulkan Memory Allocator, but it did not exist at the time I designed my allocator, and I’m pretty happy with my design as it stands. Maybe if I need de-fragmentation in the future or some really complex memory management strategy, I’ll have to rethink and use a mature library.
To get a gist of the algorithms, Granite will allocate 64 MB chunks, which are split in 32 chunks. Those 32 chunks can then be subdivided into 32 smaller chunks, etc, all the way down to 256 bytes little chunks. I made a cute little algorithm to allocate effectively from these blocks with CTZ operations and friends. Classic buddy allocator, but you have 32 buddies.
There are also dedicated allocations. I use VK_KHR_dedicated_allocation to query if an image should be allocated with a separate vkAllocateMemory rather than being allocated from the heap. This is generally useful when allocating large frame buffers on certain architectures. Also, for allocations which exceed 64 MB, dedicated allocations are used.
A nice abstraction I made is that rather than dealing with memory types like DEVICE_LOCAL, HOST_VISIBLE, and the combination of all the possible types, I declare up-front where I like my buffers and images to reside. For buffers, there are 4 use cases:
- Vulkan::BufferDomain::Device – Must reside on DEVICE_LOCAL_BIT memory. May or may not be host visible (discrete vs integrated GPUs).
- Vulkan::BufferDomain::Host – Must be HOST_VISIBLE, prefer not CACHED. This for uploads to GPU.
- Vulkan::BufferDomain::CachedHost – Must be HOST_VISIBLE and CACHED. Falls back to non-cached, but should never happen. Might not be COHERENT. Used for readbacks from GPU.
- Vulkan::BufferDomain::LinkedDeviceHost – HOST_VISIBLE and DEVICE_LOCAL. This maps to AMD’s pinned PCI mapping, which is restricted to 256 MB. I don’t think I’ve ever actually used it, but it’s a niche option if I ever need it.
When uploading initial data to a buffer, and Device is used, we can take advantage of integrated GPUs which share memory with the CPU. In this case, we can avoid any staging buffer, and just memcpy data straight into the new DEVICE_LOCAL memory. Don’t just blindly use staging buffers when you don’t need it. Integrated GPUs will generally have DEVICE_LOCAL and HOST_VISIBLE memory types.
Mapping host memory
While not present in the sample, it makes sense to discuss how we map Vulkan memory to the CPU. A good rule of thumb in general is to keep host memory persistently mapped. vkMapMemory and vkUnmapMemory is quite expensive, especially on mobile, and we can only have one mapping of a VkDeviceMemory (64 MB with tons of suballocations!) active at any time. Rather than Map/Unmap all the time, we implement map/unmap in Vulkan::Device, by checking if we need to perform cache maintenance, with no extra CPU cost. On map() for example, we need to call vkInvalidateMappedRanges if the memory type is not COHERENT, and for unmap, we call vkFlushMappedRanges if the memory is not COHERENT. This is fairly common on mobile when doing readbacks from GPU, since we need CACHED, but we might not get COHERENT. Granite’s backend abstracts all of this away.
Physical and transient image memory
A very powerful feature of Vulkan is the support for TRANSIENT images. These images do not have to be backed by physical memory in Vulkan, and is very nice on tile-based mobile renderers.
In Granite I fully support transient images where I can pass in two different domains for images, Physical and Transient. Since Transient images are generally used for throw-away scenarios, there is a convenient method in Vulkan::Device::get_transient_attachment() to simply request a transient image with a format and resolution for rendering. Transient images are generally never created manually since they are so easy to manage internally.
There are many ways to abstract handle types in general, but I went for my own “smart pointer” variant, the trusty intrusive ref-counted pointer. It can basically be thought of a std::shared_ptr, but simpler, and we can pool the allocations of handles very nicely. How we design these handle types are not really important for Vulkan though, but I figured this point would generate some questions, so I’m addressing it here. See https://github.com/Themaister/Granite/blob/master/util/intrusive.hpp for details.
03 – Deferring deletions of GPU resources
Now we’re getting into topics where there can be significant design differences between Vulkan backends. My design philosophy for a middle-level abstraction is convenient, deterministic and good enough at the cost of a theoretical optimal solution.
A common theme you’ll find in Granite is the use of RAII. Once lifetimes of objects end, we automatically clean up resources. This is nothing new to C++ programmers, but the big problem in Vulkan is we’re not managing just memory on CPU with new/delete. We actually need to carefully control when things are deleted, since the GPU might be using the resources we are freeing. The strategy here will be to defer any deletions. The sample is here: https://github.com/Themaister/Granite-MicroSamples/blob/master/03_frame_contexts.cpp
The frame context
In order to handle object lifetimes in Granite, I have a concept of a frame context. The frame context is responsible for holding all resources which belong to a “frame” of work. Normally this corresponds to a frame of work between AcquireNextImage and QueuePresent, but it is not tightly coupled. The Device has an array of frame contexts, usually 2 of them to allow double-buffering between CPU and GPU, (and 3 on Android because TBDR GPUs are a bit more pipelined and tend to prefer a little more buffering). The frame context is basically a huge data structure which holds data like:
- Which VkFences must be waited on to make sure that all GPU work associated with this queue is done. This is the gatekeeper which holds all our recycling and deletions back.
- Command pools for each worker thread and queue types.
- VkBuffers, VkImages, etc, to be deleted once the fences signal.
- Memory allocations from heap allocator to be freed.
- … and various other resources.
Basically, we have a central place to chuck any things which need to happen “later”, when the GPU is guaranteed to be done with this frame.
As a special consideration, the big fat “make it go slow” call Device::wait_idle() will automatically clean up everything in one go since it knows at this instant the GPU is not doing anything.
Command buffer lifetime compromise
To make the frame based cleanup work in practice, we need to simplify our notion of what command buffers can do. In Vulkan, we have the flexibility to record command buffers once and reuse them at will at any time. This creates some complications. First of all, it throws the idea of a per-frame command pool out of the window. We can never reset the command pool in that case, since there will be free-floating command buffers out there which might be used later. Command pools work their best in Vulkan when you don’t allow individual command buffers to be freed.
If we have reusable command buffers, we also have the problem of object lifetimes. We end up with a painful situation where GPU resources must be retained until all command buffers which reference them are discarded. This leads to a really difficult situation where you have two options – deep reference-counting per command buffer or just pray all of this works out and make sure objects are kept alive as long as necessary. The former option is very costly and bug-prone, and the latter is juggling with razor blades too much for my taste where a large, meaningless burden is placed on the user.
I generally don’t think reusable command buffers are a worthwhile idea, at least not for interactive applications where we’re not submitting a static workload to the GPU over and over. There just aren’t many reasonable use-cases where this gives you anything meaningful. The avenues where you can submit the same calls over and over are maybe restricted to post-processing, but recording a few draw calls which render a few full-screen quads (or compute dispatches for the cool kids) is not exactly where your draw call overhead is going to matter.
I find that beginners obsess over the idea of aggressive reuse a little too much. In the end I feel it is misguided, and there are many better places to spend your time. Recording command buffers itself in Vulkan is super efficient.
My ideal use for command buffers are where command buffers are light-weight handles which all source their memory from a common command pool linearly. No reuse, so we use ONE_TIME_SUBMIT_BIT and TRANSIENT_BIT on the pool.
In Granite, I greatly simplified the idea of command buffers into transient handles which are requested, recorded and submitted. They must be recorded and submitted in the same frame context you requested the command buffer. This way we remove the whole need for keeping track of objects per-command buffers. Instead, we just tie the resource destruction to a frame context, and that’s it. No need for complicated tracking, it’s very efficient, but we risk destroying the object a little later than is theoretically optimal. This could potentially increase memory pressure in certain situations, but I think the trade-off I made is good. If needed, I can always add explicit “delete this resource now, I know it’s safe” methods, but I haven’t found any need for this. This would only be important if we are truly memory bound.
A design decision I made was that I would never want to do internal ref-counts for resources like images and buffers, and the design would be forced to not rely on fine-grained tracking which you would typically find in legacy API implementations. Any ref-counted operations should be immediately visible to API users and never be hidden behind API implementations. In fact, command buffer arguments for binding resources are plain references or pointers, not ref-counted types.
The memory pressure of very large frames
The main flaw of this design is that there might be cases where there is one spurious frame context that has extreme use of creation and deletions of resources. A prime example here is loading screens or similar. Since Vulkan resources are not freed until the frame context itself is complete, we cannot recycle memory within a frame unless we explicitly iterate the frame context with Device::next_frame_context(). This tradeoff means that the Granite backend does not have to heuristically stall the GPU in order to reclaim memory at suitable times, which adds a lot of complexity and ruins the determinism of Granite’s behavior.
… up next!
In the next episode of Granite shenanigans we will look at the shader pipeline where we discuss VkShaderModule, VkDescriptorSetLayout, VkPipelineLayout and VkPipeline objects.