Adding GPU Processing to DuckDb Part 1

2025/05/18

Overview

Motivation

To learn basic GPU programming, DuckDb extensions, and the DuckDb execution model.

Using Metals on a Mac as its my local computer (would like to understand how my computer’s GPU APIs work). A CUDA version is a next step.

The entire post is currently working backwards from a simple addition of a long column that runs on the gpu.

SELECT sum_gpu(long_col) FROM long_col_tbl;

Initial Metals API Usage

Before integrating into DuckDb as an extension, we create a first pass at a cpp main() function that does a simple array add.

In this section, we target the following example execution/test:

    int length = 1024;
    std::vector<float> A(length, 1.5f);
    std::vector<float> B(length, 2.5f);
    std::vector<float> C(length, 0.0f);
  ...
    addArrays(device, commandQueue, library, A, B, C);
    for (int i = 0; i < length; ++i) {
        EXPECT_NEAR(C[i], 4.0f, 1e-5) << "Mismatch at index " << i;
    }

where addArrays is the shader

#include <metal_stdlib>
kernel void add_arrays(const device float* inA [[ buffer(0) ]],
                       const device float* inB [[ buffer(1) ]],
                       device float* out [[ buffer(2) ]],
                       uint id [[ thread_position_in_grid ]]) {
    out[id] = inA[id] + inB[id];
}

To have our cpp code execute/reference Metal shaders, we use metal-cpp. This wraps Metal’s Objective-C headers through cpp interface/headers.

We add metal-cpp into our project directory (helpful reference), and modify our project’s CMakeLists.txt:

if(APPLE)
    find_library(METAL_FRAMEWORK Metal)
    find_library(FOUNDATION_FRAMEWORK Foundation)
endif()

add_subdirectory(metal-cmake)  # Library definition
set(METAL_SHADER ${CMAKE_CURRENT_SOURCE_DIR}/shaders/add_arrays.metal)
set(METAL_AIR ${CMAKE_CURRENT_BINARY_DIR}/add_arrays.air)
set(METAL_LIB ${CMAKE_CURRENT_BINARY_DIR}/add_arrays.metallib)

# Custom commands to compile the Metal shader.
add_custom_command(
    OUTPUT ${METAL_LIB}
    COMMAND xcrun -sdk macosx metal ${METAL_SHADER} -c -o ${METAL_AIR}
    COMMAND xcrun -sdk macosx metallib ${METAL_AIR} -o ${METAL_LIB}
    DEPENDS ${METAL_SHADER}
    COMMENT "Compiling Metal shader..."
)

add_custom_target(CompileMetalShaders ALL DEPENDS ${METAL_LIB})
add_dependencies(GpuAddition CompileMetalShaders)

# Copy the metallib to the executable directory so it can be loaded at runtime.
add_custom_command(TARGET GpuAddition POST_BUILD
    COMMAND ${CMAKE_COMMAND} -E make_directory $<TARGET_FILE_DIR:GpuAddition>/shaders
    COMMAND ${CMAKE_COMMAND} -E copy ${METAL_LIB} $<TARGET_FILE_DIR:GpuAddition>/shaders/add_arrays.metallib
    COMMENT "Copying metallib to executable directory..."
)

Our compiled metal shader is now packaged in our binarys build directory. Letting us load the shader at runtime for execution.

The usage of add_arrays() requires a call to Metal API calls to MTL::CreateSystemDefaultDevice (from metal-cpp header) and METAL::Device::CommandQueue (we submit operations to the GPU via this queue), and loading of the shader itself.

    MTL::Device* device = initializeDevice();
    if (!device) {
        return -1;
    }

    MTL::CommandQueue* commandQueue = device->newCommandQueue();
    const int length = 1024;
    std::vector<float> A(length, 1.0f);
    std::vector<float> B(length, 2.0f);
    std::vector<float> C(length, 0.0f);

    const std::string libPath = "shaders/add_arrays.metal";

    MTL::Library* library = loadMetalLibrary(device, libPath);
    if (!library) {
        return -1;
    } 
    addArrays(device, commandQueue, library, A, B, C)

Now, we define our cpp function that actually sets up the array computation to the GPU:

void addArrays(MTL::Device* device, MTL::CommandQueue* commandQueue, MTL::Library* library,
               const std::vector<float>& A, const std::vector<float>& B, std::vector<float>& C) {
    const int length = static_cast<int>(A.size());
    size_t bufferSize = length * sizeof(float);

    MTL::Buffer* bufferA = device->newBuffer(A.data(), bufferSize, MTL::ResourceStorageModeShared);
    MTL::Buffer* bufferB = device->newBuffer(B.data(), bufferSize, MTL::ResourceStorageModeShared);
    MTL::Buffer* bufferC = device->newBuffer(bufferSize, MTL::ResourceStorageModeShared);

    NS::String* fname = NS::String::string("add_arrays", NS::ASCIIStringEncoding);
    NS::Error* pError = nullptr;
    MTL::Function* addArrayFunction = library->newFunction(fname, nullptr, &pError);
    if (!addArrayFunction) {
        std::cerr << "Failed to find function 'add_arrays'." << std::endl;
        return;
    }

    MTL::ComputePipelineState* pipelineState = device->newComputePipelineState(addArrayFunction, &pError);
    if (!pipelineState) {
        std::cerr << "Failed to create compute pipeline state." << std::endl;
        return;
    }

    MTL::CommandBuffer* commandBuffer = commandQueue->commandBuffer();
    MTL::ComputeCommandEncoder* computeEncoder = commandBuffer->computeCommandEncoder();

    computeEncoder->setComputePipelineState(pipelineState);
    computeEncoder->setBuffer(bufferA, 0, 0);
    computeEncoder->setBuffer(bufferB, 0, 1);
    computeEncoder->setBuffer(bufferC, 0, 2);

    MTL::Size gridSize(length, 1, 1);
    unsigned threadGroupSize = pipelineState->maxTotalThreadsPerThreadgroup();
    if (threadGroupSize > static_cast<unsigned>(length))
        threadGroupSize = length;
    MTL::Size threadsPerThreadgroup(threadGroupSize, 1, 1);

    computeEncoder->dispatchThreads(gridSize, threadsPerThreadgroup);
    computeEncoder->endEncoding();

    commandBuffer->commit();
    commandBuffer->waitUntilCompleted();

    std::memcpy(C.data(), bufferC->contents(), bufferSize);
}

And running our test gives (cmake --build . --target run_tests):

Test project GpuAddition/build
    Start 1: GpuAdditionTest.InitializeDeviceReturnsValidDevice
1/3 Test #1: GpuAdditionTest.InitializeDeviceReturnsValidDevice ...   Passed    0.04 sec
    Start 2: GpuAdditionTest.LoadMetalLibraryFailsWithWrongPath
2/3 Test #2: GpuAdditionTest.LoadMetalLibraryFailsWithWrongPath ...   Passed    0.03 sec
    Start 3: GpuAdditionTest.AddArraysCorrectlyAddsElements
3/3 Test #3: GpuAdditionTest.AddArraysCorrectlyAddsElements .......   Passed    0.04 sec

100% tests passed, 0 tests failed out of 3