Hide contents

Metal by Tutorials

Section I: Beginning Metal

Section 1: 10 chapters
Show chapters Hide chapters

Section II: Intermediate Metal

Section 2: 8 chapters
Show chapters Hide chapters

Section III: Advanced Metal

Section 3: 8 chapters
Show chapters Hide chapters

16 GPU Compute Programming
Written by Caroline Begbie & Marius Horga

Heads up... You're reading this book for free, with parts of this chapter shown beyond this point as scrambled text.

General Purpose GPU (GPGPU) programming uses the many-core GPU architecture to speed up parallel computation. Data-parallel compute processing is useful when you have large chunks of data and need to perform the same operation on each chunk. Examples include machine learning, scientific simulations, ray tracing and image/video processing.

In this chapter, you’ll perform some simple GPU programming and explore how to use the GPU in ways other than vertex rendering.

The Starter Project

➤ Open Xcode and build and run this chapter’s starter project. The scene contains a lonely warrior. The renderer is the forward renderer using your Phong shader.

The starter project
The starter project

From this render, you might think that the warrior is left-handed. Depending on how you render him, he can be ambidextrous.

➤ Press 1 on your keyboard.

The view changes to the front view. However, the warrior faces towards positive z instead of toward the camera.

Facing backwards
Facing backwards

The way the warrior renders is due to both math and file formats. In Chapter 6, “Coordinate Spaces”, you learned that this book uses a left-handed coordinate system. Blender exports the obj file for use in a right-handed coordinate system.

If you want a right-handed warrior, there are a few ways to solve this issue:

  1. Rewrite all of your coordinate positioning.
  2. In vertex_main, invert position.z when rendering the model.
  3. On loading the model, invert position.z.

If all of your models are reversed, option #1 or #2 might be good. However, if you only need some models reversed, option #3 is the way to go. All you need is a fast parallel operation. Thankfully, one is available to you using the GPU.

Note: Ideally, you would convert the model as part of your model pipeline rather than in your final app. After flipping the vertices, you can write the model out to a new file.

Winding Order and Culling

Inverting the z position will flip the winding order of vertices, so you may need to consider this. When Model I/O reads in the model, the vertices are in clockwise winding order.

Rendering with incorrect winding order
Wubjokaqk jojv evseqvanb tozmiwr epkif

Reversing the Model on the CPU

Before working out the parallel algorithm for the GPU, you’ll first explore how to reverse the warrior on the CPU. You’ll compare the performance with the GPU result. In the process, you’ll learn how to access and change Swift data buffer contents with pointers.

struct VertexLayout {
  vector_float3 position;
  vector_float3 normal;
mutating func convertMesh(_ model: Model) {
  let startTime = CFAbsoluteTimeGetCurrent()
  for mesh in model.meshes {
    // 1
    let vertexBuffer = mesh.vertexBuffers[VertexBuffer.index]
    let count =
      vertexBuffer.length / MemoryLayout<VertexLayout>.stride
    // 2
    var pointer = vertexBuffer
      .bindMemory(to: VertexLayout.self, capacity: count)
    // 3
    for _ in 0..<count {
      // 4
      pointer.pointee.position.z = -pointer.pointee.position.z
      // 5
      pointer = pointer.advanced(by: 1)
  // 6
  print("CPU Time:", CFAbsoluteTimeGetCurrent() - startTime)
A right-handed warrior
U jumvn-yabjob nempuul

Compute Processing

In many ways, compute processing is similar to the render pipeline. You set up a command queue and a command buffer. In place of the render command encoder, compute uses a compute command encoder. Instead of using vertex or fragment functions in a compute pass, you use a kernel function. Threads are the input to the kernel function, and the kernel function operates on each thread.

Threads and Threadgroups

To determine how many times you want the kernel function to run, you need to know the size of the array, texture or volume you want to process. This size is the grid and consists of threads organized into threadgroups.

Threads and threadgroups
Xcqoint enk bpguaydjeaks

let threadsPerGrid = MTLSize(width: 512, height: 384, depth: 1)
let width = pipelineState.threadExecutionWidth
let threadsPerThreadgroup = MTLSize(
  width: width,
  height: pipelineState.maxTotalThreadsPerThreadgroup / width,
  depth: 1)
  threadsPerThreadgroup: threadsPerThreadgroup)

Non-uniform Threadgroups

The threads and threadgroups work out evenly across the grid in the previous image example. However, if the grid size isn’t a multiple of the threadgroup size, Metal provides non-uniform threadgroups.

Non-uniform threadgroups
Buz-exipujr xmkiutvxoafc

Threadgroups per Grid

You can choose how you split up the grid. Threadgroups have the advantage of executing a group of threads together and also sharing a small chunk of memory. It’s common to organize threads into threadgroups to work on smaller parts of the problem independently from other threadgroups.

Threadgroups in a 2D grid
Syzeofmloeky es e 5L fhop

let width = 32
let height = 16
let threadsPerThreadgroup = MTLSize(
  width: width, height: height, depth: 1)
let gridWidth = 512
let gridHeight = 384
let threadGroupCount = MTLSize(
  width: (gridWidth + width - 1) / width,
  height: (gridHeight + height - 1) / height,
  depth: 1)
  threadsPerThreadgroup: threadsPerThreadgroup)
Underutilized threads
Uxyucupowidaz kysoamx

Reversing the Warrior Using GPU Compute Processing

The previous example was a two-dimensional image, but you can create grids in one, two or three dimensions. The warrior problem acts on an array in a buffer and will require a one-dimensional grid.

func convertMesh() {
// 1
  guard let commandBuffer =
    let computeEncoder = commandBuffer.makeComputeCommandEncoder()
      else { return }
  // 2
  let startTime = CFAbsoluteTimeGetCurrent()
  // 3
  let pipelineState: MTLComputePipelineState
  do {
    // 4
    guard let kernelFunction =
      Renderer.library.makeFunction(name: "convert_mesh") else {
        fatalError("Failed to create kernel function")
    // 5
    pipelineState = try
        function: kernelFunction)
  } catch {
for mesh in meshes {
  let vertexBuffer = mesh.vertexBuffers[VertexBuffer.index]
  computeEncoder.setBuffer(vertexBuffer, offset: 0, index: 0)
  let vertexCount = vertexBuffer.length /

Setting up Threadgroups

➤ At the bottom and within the for loop closure, continue with:

let threadsPerGroup = MTLSize(
  width: pipelineState.threadExecutionWidth,
  height: 1,
  depth: 1)
let threadsPerGrid = MTLSize(width: vertexCount, height: 1, depth: 1)
  threadsPerThreadgroup: threadsPerGroup)

Performing Code After Completing GPU Execution

The command buffer can execute a closure after its GPU operations have finished.

commandBuffer.addCompletedHandler { _ in
    "GPU conversion time:",
    CFAbsoluteTimeGetCurrent() - startTime)

The Kernel Function

That completes the Swift setup. You simply specify the kernel function to the pipeline state and create an encoder using that pipeline state. With that, it’s only necessary to give the thread information to the encoder. The rest of the action takes place inside the kernel function.

#import "Common.h"

kernel void convert_mesh(
  device VertexLayout *vertices [[buffer(0)]],
  uint id [[thread_position_in_grid]])
  vertices[id].position.z = -vertices[id].position.z;
A right-handed warrior
I lowdv-pupzah bokkiey

Atomic Functions

Kernel functions perform operations on individual threads. However, you may want to perform an operation that requires information from other threads. For example, you might want to find out the total number of vertices your kernel worked on.

let totalBuffer = Renderer.device.makeBuffer(
  length: MemoryLayout<Int>.stride,
  options: [])
let vertexTotal = totalBuffer?.contents().bindMemory(to: Int.self, capacity: 1)
vertexTotal?.pointee = 0
computeEncoder.setBuffer(totalBuffer, offset: 0, index: 1)
print("Total Vertices:", vertexTotal?.pointee ?? -1)
device int &vertexTotal [[buffer(1)]],

device atomic_int &vertexTotal [[buffer(1)]],
atomic_fetch_add_explicit(&vertexTotal, 1, memory_order_relaxed);

Key Points

  • GPU compute, or general purpose GPU programming, helps you perform data operations in parallel without using the more specialized rendering pipeline.
  • You can move any task that operates on multiple items independently to the GPU. Later, you’ll see that you can even move the repetitive task of rendering a scene to a compute shader.
  • GPU memory is good at simple parallel operations, and with Apple Silicon, you can keep chained operations in tile memory instead of moving them back to system memory.
  • Compute processing uses a compute pipeline with a kernel function.
  • The kernel function operates on a grid of threads organized into threadgroups. This grid can be 1D, 2D or 3D.
  • Atomic functions allow inter-thread operations.
Have a technical question? Want to report a bug? You can ask questions and report bugs to the book authors in our official book forum here.
© 2022 Kodeco Inc.

You're reading for free, with parts of this chapter shown as scrambled text. Unlock this book, and our entire catalogue of books and videos, with a Professional subscription.

Unlock Now