Overview
- Motivation
- Metals API Usage
- Adding Extension into Duckdb
- DuckDb Execution Model
Motivation
To learn basic GPU programming, DuckDb extensions, and the DuckDb execution model.
Using Metals on a Mac as its my local computer (would like to understand how my computer’s GPU APIs work). A CUDA version is a next step.
The entire post is currently working backwards from a simple addition of a long column that runs on the gpu.
SELECT sum_gpu(long_col) FROM long_col_tbl;
Initial Metals API Usage
Before integrating into DuckDb as an extension, we create a first pass at a cpp main()
function that does a simple array add.
In this section, we target the following example execution/test:
int length = 1024;
std::vector<float> A(length, 1.5f);
std::vector<float> B(length, 2.5f);
std::vector<float> C(length, 0.0f);
...
addArrays(device, commandQueue, library, A, B, C);
for (int i = 0; i < length; ++i) {
EXPECT_NEAR(C[i], 4.0f, 1e-5) << "Mismatch at index " << i;
}
where addArrays is the shader
#include <metal_stdlib>
kernel void add_arrays(const device float* inA [[ buffer(0) ]],
const device float* inB [[ buffer(1) ]],
device float* out [[ buffer(2) ]],
uint id [[ thread_position_in_grid ]]) {
out[id] = inA[id] + inB[id];
}
To have our cpp code execute/reference Metal shaders, we use metal-cpp. This wraps Metal’s Objective-C headers through cpp interface/headers.
We add metal-cpp into our project directory (helpful reference), and modify our project’s CMakeLists.txt
:
if(APPLE)
find_library(METAL_FRAMEWORK Metal)
find_library(FOUNDATION_FRAMEWORK Foundation)
endif()
add_subdirectory(metal-cmake) # Library definition
set(METAL_SHADER ${CMAKE_CURRENT_SOURCE_DIR}/shaders/add_arrays.metal)
set(METAL_AIR ${CMAKE_CURRENT_BINARY_DIR}/add_arrays.air)
set(METAL_LIB ${CMAKE_CURRENT_BINARY_DIR}/add_arrays.metallib)
# Custom commands to compile the Metal shader.
add_custom_command(
OUTPUT ${METAL_LIB}
COMMAND xcrun -sdk macosx metal ${METAL_SHADER} -c -o ${METAL_AIR}
COMMAND xcrun -sdk macosx metallib ${METAL_AIR} -o ${METAL_LIB}
DEPENDS ${METAL_SHADER}
COMMENT "Compiling Metal shader..."
)
add_custom_target(CompileMetalShaders ALL DEPENDS ${METAL_LIB})
add_dependencies(GpuAddition CompileMetalShaders)
# Copy the metallib to the executable directory so it can be loaded at runtime.
add_custom_command(TARGET GpuAddition POST_BUILD
COMMAND ${CMAKE_COMMAND} -E make_directory $<TARGET_FILE_DIR:GpuAddition>/shaders
COMMAND ${CMAKE_COMMAND} -E copy ${METAL_LIB} $<TARGET_FILE_DIR:GpuAddition>/shaders/add_arrays.metallib
COMMENT "Copying metallib to executable directory..."
)
Our compiled metal shader is now packaged in our binarys build directory. Letting us load the shader at runtime for execution.
The usage of add_arrays()
requires a call to Metal API calls to MTL::CreateSystemDefaultDevice
(from metal-cpp header) and METAL::Device::CommandQueue (we submit operations to the GPU via this queue), and loading of the shader itself.
MTL::Device* device = initializeDevice();
if (!device) {
return -1;
}
MTL::CommandQueue* commandQueue = device->newCommandQueue();
const int length = 1024;
std::vector<float> A(length, 1.0f);
std::vector<float> B(length, 2.0f);
std::vector<float> C(length, 0.0f);
const std::string libPath = "shaders/add_arrays.metal";
MTL::Library* library = loadMetalLibrary(device, libPath);
if (!library) {
return -1;
}
addArrays(device, commandQueue, library, A, B, C)
Now, we define our cpp function that actually sets up the array computation to the GPU:
void addArrays(MTL::Device* device, MTL::CommandQueue* commandQueue, MTL::Library* library,
const std::vector<float>& A, const std::vector<float>& B, std::vector<float>& C) {
const int length = static_cast<int>(A.size());
size_t bufferSize = length * sizeof(float);
MTL::Buffer* bufferA = device->newBuffer(A.data(), bufferSize, MTL::ResourceStorageModeShared);
MTL::Buffer* bufferB = device->newBuffer(B.data(), bufferSize, MTL::ResourceStorageModeShared);
MTL::Buffer* bufferC = device->newBuffer(bufferSize, MTL::ResourceStorageModeShared);
NS::String* fname = NS::String::string("add_arrays", NS::ASCIIStringEncoding);
NS::Error* pError = nullptr;
MTL::Function* addArrayFunction = library->newFunction(fname, nullptr, &pError);
if (!addArrayFunction) {
std::cerr << "Failed to find function 'add_arrays'." << std::endl;
return;
}
MTL::ComputePipelineState* pipelineState = device->newComputePipelineState(addArrayFunction, &pError);
if (!pipelineState) {
std::cerr << "Failed to create compute pipeline state." << std::endl;
return;
}
MTL::CommandBuffer* commandBuffer = commandQueue->commandBuffer();
MTL::ComputeCommandEncoder* computeEncoder = commandBuffer->computeCommandEncoder();
computeEncoder->setComputePipelineState(pipelineState);
computeEncoder->setBuffer(bufferA, 0, 0);
computeEncoder->setBuffer(bufferB, 0, 1);
computeEncoder->setBuffer(bufferC, 0, 2);
MTL::Size gridSize(length, 1, 1);
unsigned threadGroupSize = pipelineState->maxTotalThreadsPerThreadgroup();
if (threadGroupSize > static_cast<unsigned>(length))
threadGroupSize = length;
MTL::Size threadsPerThreadgroup(threadGroupSize, 1, 1);
computeEncoder->dispatchThreads(gridSize, threadsPerThreadgroup);
computeEncoder->endEncoding();
commandBuffer->commit();
commandBuffer->waitUntilCompleted();
std::memcpy(C.data(), bufferC->contents(), bufferSize);
}
And running our test gives (cmake --build . --target run_tests
):
Test project GpuAddition/build
Start 1: GpuAdditionTest.InitializeDeviceReturnsValidDevice
1/3 Test #1: GpuAdditionTest.InitializeDeviceReturnsValidDevice ... Passed 0.04 sec
Start 2: GpuAdditionTest.LoadMetalLibraryFailsWithWrongPath
2/3 Test #2: GpuAdditionTest.LoadMetalLibraryFailsWithWrongPath ... Passed 0.03 sec
Start 3: GpuAdditionTest.AddArraysCorrectlyAddsElements
3/3 Test #3: GpuAdditionTest.AddArraysCorrectlyAddsElements ....... Passed 0.04 sec
100% tests passed, 0 tests failed out of 3