Reduce-then-scan GPU radix sort, implemented as a single-file header-only Vulkan library. Easily integrates into any Vulkan project without additional dependencies, making it suitable for applications such as 3D Gaussian Splatting rendering.
Note: As of January 2025, this library was competitive with CUB Reduce-then-Scan radix sort. Benchmarking in April 2026 against CUDA 13.2 and CUB v3.2.0 (which now defaults to Onesweep) shows CUB is faster by 2.1× on keys-only and 1.3× on key-value at N = 2^25. Nevertheless, it remains a practical choice for Vulkan-based applications such as 3D Gaussian Splatting.
VulkanSDK >= 1.4.328.1— download from https://vulkan.lunarg.com/slangcis included in VulkanSDK >= 1.3.296.0- Push descriptor requires VulkanSDK >= 1.4 (>= 1.4.328.1 for macOS)
cmake >= 3.24
$ cmake . -B build
$ cmake --build build --config Release -j$ ./build/Release/bench.exe <type> [output.csv] # Windows
$ ./build/bench <type> [output.csv] # Linuxtype:cpu,vulkan,cuda,fuchsia- Sweeps N from 2^18 to 2^25 (128 steps), 1 warmup + 10 timed runs each
- Outputs median GPU and CPU throughput to CSV
Plot results:
$ python tools/plot.py vulkan.csv cuda.csv --output results.pngTest environment: Windows, NVIDIA GeForce RTX 5080, CUDA 13.2, CUB v3.2.0 (Onesweep default).
Median throughput at N = 2^25 (33,554,432 elements). Ratios are relative to this library (> 1× means the competitor is faster).
| Sort type | This library (Vulkan) | Fuchsia (Vulkan) | CUB Onesweep (CUDA) |
|---|---|---|---|
| 32-bit keys only | 10.66 GItems/s | 13.58 GItems/s (1.27×) | 22.40 GItems/s (2.10×) |
| 32-bit key-value | 9.04 GItems/s | 5.02 GItems/s (0.56×) | 11.68 GItems/s (1.29×) |
Fuchsia radix sort is faster on keys-only, but 1.80× slower on key-value. Fuchsia sorts key-value pairs as a single 64-bit key, doubling memory traffic per pass, while this library sorts the two buffers independently.
This is a single header-only library. You can integrate it via CMake or by copying include/vk_radix_sort.h directly into your project.
-
Add subdirectory:
add_subdirectory(path/to/vulkan_radix_sort)
-
Link to
vk_radix_sort:# With Volk target_link_libraries(my_project PRIVATE volk::volk_headers vk_radix_sort) # With standard Vulkan loader target_link_libraries(my_project PRIVATE Vulkan::Vulkan vk_radix_sort)
Copy include/vk_radix_sort.h into your project and include it directly.
-
In exactly one source file, define
VRDX_IMPLEMENTATIONbefore includingvk_radix_sort.h.If you are using Volk, include
volk.hfirst so the library uses Volk's function pointer dispatch. Otherwise it falls back to the standard<vulkan/vulkan.h>prototypes.// With Volk #include "volk.h" #define VRDX_IMPLEMENTATION #include "vk_radix_sort.h" // With standard Vulkan headers #define VRDX_IMPLEMENTATION #include "vk_radix_sort.h"
-
Create
VkBufferfor keys and values withVK_BUFFER_USAGE_STORAGE_BUFFER_BIT. -
Create
VrdxSorter:VrdxSorter sorter = VK_NULL_HANDLE; VrdxSorterCreateInfo sorterInfo = {}; sorterInfo.physicalDevice = physicalDevice; sorterInfo.device = device; sorterInfo.pipelineCache = pipelineCache; vrdxCreateSorter(&sorterInfo, &sorter); -
Allocate a temporary storage buffer:
VrdxSorterStorageRequirements requirements; vrdxGetSorterStorageRequirements(sorter, elementCount, &requirements); // keys only vrdxGetSorterKeyValueStorageRequirements(sorter, elementCount, &requirements); // key-value VkBufferCreateInfo bufferInfo = {VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO}; bufferInfo.size = requirements.size; bufferInfo.usage = requirements.usage; // ...
-
Record sort commands.
Buffer offsets must be multiples of
minStorageBufferOffsetAlignment(usually16).The sort command binds its own pipeline, pipeline layout, and push constants — previously bound state is not preserved after the call.
Add execution barriers around the sort. Use global memory barriers rather than per-resource barriers (Vulkan synchronization examples):
- Before the sort:
COMPUTE_SHADERstage (+TRANSFERfor indirect),SHADER_READaccess (+TRANSFER_READfor indirect). - After the sort:
COMPUTE_SHADERstage,SHADER_WRITEaccess.
VkQueryPool queryPool; // VK_NULL_HANDLE, or a timestamp query pool with at least 15 entries. // Sort keys only vrdxCmdSort(commandBuffer, sorter, elementCount, keysBuffer, 0, storageBuffer, 0, queryPool, 0); // Sort keys with values vrdxCmdSortKeyValue(commandBuffer, sorter, elementCount, keysBuffer, 0, valuesBuffer, 0, storageBuffer, 0, queryPool, 0); // Sort with indirect element count (read from GPU buffer) // Actual count in indirectBuffer must not exceed maxElementCount. vrdxCmdSortKeyValueIndirect(commandBuffer, sorter, maxElementCount, indirectBuffer, 0, keysBuffer, 0, valuesBuffer, 0, storageBuffer, 0, queryPool, 0);
- Before the sort:
After modifying shaders, run the cmake build. It runs slangc, generates src/generated/*.h, then regenerates include/vk_radix_sort.h from src/vk_radix_sort.h.in via tools/generate_header.py.
- Compare with VkRadixSort.
- Find optimal
WORKGROUP_SIZEandPARTITION_DIVISIONfor different devices.
- https://github.com/b0nes164/GPUSorting — CUDA kernel references for understanding the algorithm.
NVIDIA GPU (Windows): slow performance after a few seconds.
- Cause: NVIDIA driver downgrades GPU/memory clock under sustained load. Check with the Performance Overlay (Alt+R).
- Fix: set performance mode to maximum in NVIDIA Control Panel.

