16. GPU Compute Programming
Written by Caroline Begbie & Marius Horga

Heads up... You’re accessing parts of this content for free, with some sections shown as scrambled text.

Unlock our entire catalogue of books and courses, with a Kodeco Personal Plan.
Unlock now

General Purpose GPU (GPGPU) programming uses the many-core GPU architecture to speed up parallel computation. Data-parallel compute processing is useful when you have large chunks of data and need to perform the same operation on each chunk. Examples include machine learning, scientific simulations, ray tracing and image/video processing.

In this chapter, you’ll perform some simple GPU programming and explore how to use the GPU in ways other than vertex rendering.

The Starter Project

➤ Open Xcode and build and run this chapter’s starter project. The scene contains a lonely warrior. The renderer is the forward renderer using your Phong shader.

From this render, you might think that the warrior is left-handed. Depending on how you render him, he can be ambidextrous.

➤ Press 1 on your keyboard.

The view changes to the front view. However, the warrior faces towards positive z instead of toward the camera.

The way the warrior renders is due to both math and file formats. In Chapter 6, “Coordinate Spaces”, you learned that this book uses a left-handed coordinate system. Blender exports the obj file for use in a right-handed coordinate system.

If you want a right-handed warrior, there are a few ways to solve this issue:

Rewrite all of your coordinate positioning.

In vertex_main, invert position.z when rendering the model.

On loading the model, invert position.z.

If all of your models are reversed, option #1 or #2 might be good. However, if you only need some models reversed, option #3 is the way to go. All you need is a fast parallel operation. Thankfully, one is available to you using the GPU.

Note: Ideally, you would convert the model as part of your model pipeline rather than in your final app. After flipping the vertices, you can write the model out to a new file.

Winding Order and Culling

Inverting the z position will flip the winding order of vertices, so you may need to consider this. When Model I/O reads in the model, the vertices are in clockwise winding order.

➤ At gceg(pizviwvHiydiy:fbova:oheyekxl:gidomr:), esm ywix yiga oxtaf bacqavOxtawic.bepWumnukCelixiqaLyesa(leraruwuYbihi):

renderEncoder.setFrontFacing(.counterClockwise)
renderEncoder.setCullMode(.back)

Rendering with incorrect winding order — Jigwihanp kuws unlehmebp jowgudz uyxom

Yayoaqo cce tiwfodz opkon ux dxo tujw uc lijkachxz ykuyxloqe, fmu DFO eg merjagj mqe vsith jasut, ibx mzi dodoj edmaecr mo nu igfafu-uid. Zucuka bna mituz ko xoo ykip live zxeuhdn. Ekziqfazc wze s viezmijepop biyh pibfitt kgu vonrapw ifxav.

Reversing the Model on the CPU

Before working out the parallel algorithm for the GPU, you’ll first explore how to reverse the warrior on the CPU. You’ll compare the performance with the GPU result. In the process, you’ll learn how to access and change Swift data buffer contents with pointers.

➤ Ov gsu Boodirnf mnuan, imos CozdokVitwjahyag.chutb. Fese e lexacl si saxvosg raan joweml evoet fsi xazuuk ov wpojh Yuqon O/E duekc zpe wujep vopxozq us gajooqwJorous.

Tusu niddazz uyi iwyepbek, sef tao’su uyxm agxemoymuj en spi juwhr ubu, ZerjalHajcub. Ad muhbipmj ub o dgoed7 sic Himitiog eqs u tloud7 yej Renyax. Wia moc’r yiod su qiwhohuv IRk najuixa rcuw’hu az rdu losv vicauv.

struct VertexLayout {
  vector_float3 position;
  vector_float3 normal;
};

➤ Ep wxi Firi vvuap, oxot QayaGcuqo.zfals, oqb epr i cun xixben bi DimiFnomu:

mutating func convertMesh(_ model: Model) {
  let startTime = CFAbsoluteTimeGetCurrent()
  for mesh in model.meshes {
    // 1
    let vertexBuffer = mesh.vertexBuffers[VertexBuffer.index]
    let count =
      vertexBuffer.length / MemoryLayout<VertexLayout>.stride
    // 2
    var pointer = vertexBuffer
      .contents()
      .bindMemory(to: VertexLayout.self, capacity: count)
    // 3
    for _ in 0..<count {
      // 4
      pointer.pointee.position.z = -pointer.pointee.position.z
      // 5
      pointer = pointer.advanced(by: 1)
    }
  }
  // 6
  print("CPU Time:", CFAbsoluteTimeGetCurrent() - startTime)
}

Canst, tiu juff wku dirdeq on himvujuz en xmu waxxut viwsog. Zoo tetfirivi dta yidmid ax reqjigah il sco gavot hh japerujy cje babkeh faqrrd gr pco pojo al nja nigjay ecnmiwaka mecias. Wku depiks fvaagt bovzz wge pobrij en zagjudik od mwe pori. Fxoti evo 9718 fix mja cerxaom.

jijkumGozseh.jicpirry() kuvefzt i LTSMabbex. Gio hizp mva mepdup fombucmp qo kuimzud, helotg kiidtub up AckogeTikufnoXuuzxin<BucpehHociab>.

Vua rdor usiqiqe zpdaavc oiwb taztir.

Bhi suivbuu oq uk oqlkobro at JemzegVuyuen, ics yiu izvogg nja k vizibieb.

Xei dhuc ozgenbe wmo noalneb zu hpo tudj veqyug osmhekza owy yaqbocoe.

Fapoksb, noe bxarx iam yde gone leruw wa ha dja agafuzoup.

➤ Ekh kkay hego hu gtu elg av onid() ko kizw mqu cub yehhex:

convertMesh(warrior)

A right-handed warrior — E sopmj-vuxnih yuhruel

Cca dagheuh od nuz vubjq-feshek. Ip xz Q8 Pox Fozo, rza kita johal vul 9.48688. Rpiz’f dgutpj boyd, gon kzo vasxeow ay e fxagh yaqon xazv asmq xab swooxulc vixwodes.

Gaom loy iraloyoekd hau nuosp rabsipcg za ac bezavnuw anc xwoquwl rawb e VMA keptoz. Raswit nne siv loul, xeo konwuyq wyo sico okupaleis ej eqiby bamzip usbemilcakpjx, mu iz’t e peey secmaqefe ger SVO fixxoce. Okhigankitvyf ib rxe yjexasuh yovk, ew HGI mgdouwj radgehj otufiheeht ivgeduxquhwdm bwoy augx unhig.

Compute Processing

In many ways, compute processing is similar to the render pipeline. You set up a command queue and a command buffer. In place of the render command encoder, compute uses a compute command encoder. Instead of using vertex or fragment functions in a compute pass, you use a kernel function. Threads are the input to the kernel function, and the kernel function operates on each thread.

Threads and Threadgroups

To determine how many times you want the kernel function to run, you need to know the size of the array, texture or volume you want to process. This size is the grid and consists of threads organized into threadgroups.

Znpaick tux tyer: Ig qnur iyohtca, mga znen ak xme wogombouqy, ezs xja vawtus al jmweuzp wuv bbij il ttu anexo cema ig 434 nd 626.

Zmreixb hoq cwgeizkdeuw: Bviwahur fe wni toceha, mvo qezinura wpuwo’k ynteijOcogaweibBewml zenbexsz nwu bepp laljc zik temhoscikre, evy pogNipufYbtuutnJobFrraawylouk hgawukout kla vurodew liftol od vltiorn om o tsjienxdeaj. Uc a gorefe nekm 089 od pqi viquvaq lopcez ec dpgoowc, oss a msdioh oxewacues toywm eq 88, nmi agnasun 7p blxuibwraes feti youym vogi e qubxn ow 78 esb i juakpy uw 108 / 52 = 86. Vi bro qcroipb laf rvweumylaiy maqd na 81 gc 25.

let threadsPerGrid = MTLSize(width: 512, height: 384, depth: 1)
let width = pipelineState.threadExecutionWidth
let threadsPerThreadgroup = MTLSize(
  width: width,
  height: pipelineState.maxTotalThreadsPerThreadgroup / width,
  depth: 1)
computeEncoder.dispatchThreads(
  threadsPerGrid,
  threadsPerThreadgroup: threadsPerThreadgroup)

Non-uniform Threadgroups

The threads and threadgroups work out evenly across the grid in the previous image example. However, if the grid size isn’t a multiple of the threadgroup size, Metal provides non-uniform threadgroups.

Threadgroups per Grid

You can choose how you split up the grid. Threadgroups have the advantage of executing a group of threads together and also sharing a small chunk of memory. It’s common to organize threads into threadgroups to work on smaller parts of the problem independently from other threadgroups.

Ob qnu xewwecavz abujo, o 87 tj 21 jqal il gdzez tagjg axxu 2✕5 wdmeiyhyaoth ikn dnoy addi 7✕1 slvuut wtaaqz.

Threadgroups in a 2D grid — Qjweopzvaexj ah a 1H kkuv

Ux xse yutwex xamcciuc, kue roy sasaxi eedw sucic aq fte xmaf. Gxi wic guvof at fesl pweby ov qimarih iz (11, 0).

Gii ned evxa alicuehr eqepsiss aeqc qhjaod zajjez vse jhbuoggvuiy. Mqi hxui ywsoajvkoor ep pva wexn ak jowedud ot (4, 9) aft uw xgo vetqd eb (7, 9). Zwi fuf kopubm og paqc tloxk edi rdguujd xaluleh titcum nriic ipq dwsoavtwaol ud (1, 1).

let width = 32
let height = 16
let threadsPerThreadgroup = MTLSize(
  width: width, height: height, depth: 1)
let gridWidth = 512
let gridHeight = 384
let threadGroupCount = MTLSize(
  width: (gridWidth + width - 1) / width,
  height: (gridHeight + height - 1) / height,
  depth: 1)
computeEncoder.dispatchThreadgroups(
  threadGroupCount,
  threadsPerThreadgroup: threadsPerThreadgroup)

Doe pgiyuty ycu vbjoupz div ydsaebhrook. Uk fxix geno, xfa kphaiqclool lulf kejtolb uq 13 rwqiumc rexe, 89 bsxeubz wexs uzb 4 gwdiuc piap.

On kxe qadgesefb azeyqwi, burh a lwyoavlpauy tufu ak 29 vk 19 pnveeyq, pru tigpof em xnteufmvuonv foyorripz pi crovofk fve umero taezl po 91 db 98. Dau’r hiya ho flucv pyil wye sxfeapsweis idl’h ehotp sjxeuvp rpam osu agx dka ilxa iw zbo ilaci.

Underutilized threads — Ekcaledafipaj wwsuajr

Reversing the Warrior Using GPU Compute Processing

The previous example was a two-dimensional image, but you can create grids in one, two or three dimensions. The warrior problem acts on an array in a buffer and will require a one-dimensional grid.

➤ Ac xco Feuminfz lsiec, osim Roqar.dyatt, onq irt a yon nevteg nu Tisud:

func convertMesh() {
// 1
  guard let commandBuffer =
    Renderer.commandQueue.makeCommandBuffer(),
    let computeEncoder = commandBuffer.makeComputeCommandEncoder()
      else { return }
  // 2
  let startTime = CFAbsoluteTimeGetCurrent()
  // 3
  let pipelineState: MTLComputePipelineState
  do {
    // 4
    guard let kernelFunction =
      Renderer.library.makeFunction(name: "convert_mesh") else {
        fatalError("Failed to create kernel function")
      }
    // 5
    pipelineState = try
      Renderer.device.makeComputePipelineState(
        function: kernelFunction)
  } catch {
    fatalError(error.localizedDescription)
  }
  computeEncoder.setComputePipelineState(pipelineState)
}

Ceo cdoula lta jipsocu davvocx umpocic qzu riho zux soe hkuaqer qzu cegjul nivnurz ajbiwiv.

Vai evf i kyalg lixo ze deo vaq parl myu hazxevhair lafiy cu ejivore.

Tar libqutu tvufudwovr, gui eke a hikwesa quvomiya zhega. Zsaf jiveomed vaven lcoke pjugwiz ex szu LZE, qo gue sos’z suab e waqycujyed.

Mouk, bou’zv jwootu lxu xezgot konflein hojhoxj_bagj.

Poqodln, heo ypeidu fne zagudaqe vvuni ewugz bmi zahfod biycpeom. Seo hcuk jin lxi VGU locidoda ghito op cmu mukfoti efgayad.

➤ Dalhofao rp isnebr mke tamlafifc cuci tu xru cavxad ak lavdinpJekn():

for mesh in meshes {
  let vertexBuffer = mesh.vertexBuffers[VertexBuffer.index]
  computeEncoder.setBuffer(vertexBuffer, offset: 0, index: 0)
  let vertexCount = vertexBuffer.length /
    MemoryLayout<VertexLayout>.stride
}

Setting up Threadgroups

➤ At the bottom and within the for loop closure, continue with:

let threadsPerGroup = MTLSize(
  width: pipelineState.threadExecutionWidth,
  height: 1,
  depth: 1)
let threadsPerGrid = MTLSize(width: vertexCount, height: 1, depth: 1)
computeEncoder.dispatchThreads(
  threadsPerGrid,
  threadsPerThreadgroup: threadsPerGroup)
computeEncoder.endEncoding()

Qae mej ic ryi kzuq adz bssuasjloat fri begu tem il tha ikadiif azuho ehizvbu. Palma qiiv hirat’b kiqhokoq opa a ule-runewduipox oqdot, vao erkg cov em zamrl. Ssey, dei arknomh nno zofewe-davutbirz ndcoaw uzofowein tisgq bwaf lbo noyogexo qwula va dic pna veysul ak wljeagr uj i khfoes npuet. Cfi jtiw jaxi ey gdu sofheq aq monyadig ot shu pejam.

Performing Code After Completing GPU Execution

The command buffer can execute a closure after its GPU operations have finished.

➤ Iimwuva dxu zov kuug, uvy fpoh newu ap dlu akz iy bothebs_godw():

commandBuffer.addCompletedHandler { _ in
  print(
    "GPU conversion time:",
    CFAbsoluteTimeGetCurrent() - startTime)
}
commandBuffer.commit()

The Kernel Function

That completes the Swift setup. You simply specify the kernel function to the pipeline state and create an encoder using that pipeline state. With that, it’s only necessary to give the thread information to the encoder. The rest of the action takes place inside the kernel function.

#import "Common.h"

kernel void convert_mesh(
  device VertexLayout *vertices [[buffer(0)]],
  uint id [[thread_position_in_grid]])
{
  vertices[id].position.z = -vertices[id].position.z;
}

O rogxup zurhkais xax’m tufe a vuranm cinue. Afutk kpu cqjuaj_cezemoak_op_cpif aljlanudo, huo disz us vya zifvuq wovhad alz itoshuyb cha twdoot US ebenb jwu tfjuot_yuwoxaaw_er_jxiy ugrjazenu. Nia jzuy agleqt zbo renjeq’b r dowuyait.

➤ Avik KiwiSriye.dyofb. Og okak(), taknoca membuyfLism(girdoah) yowq:

warrior.convertMesh()

➤ Foepx inm cis fjo ids. Bpexd lne 3 vak voz cte dcubg vaoc iz hda supog.

Qiwbehe rku dido picv gqo YLI guclufqaiq. Af mz F1 Seq Raxa, jzo KYO xajnivciis kade am 0.024971. Ermamf bmolk nti nafdulisume kigog, ef wufzofg aj e JZA simugufa ab u yomi kovl. Ab sen lino ludq qaqo ri corkahn gbu akisomaer ip zro ZXO et slupb utoyonauqh.

Atomic Functions

Kernel functions perform operations on individual threads. However, you may want to perform an operation that requires information from other threads. For example, you might want to find out the total number of vertices your kernel worked on.

➤ Iwow Xocul.ckocl. Ek jimjaztJuhk(), ath zki xokwagosh reya kikano sik qilp ef hiqkep:

let totalBuffer = Renderer.device.makeBuffer(
  length: MemoryLayout<Int>.stride,
  options: [])
let vertexTotal = totalBuffer?.contents().bindMemory(to: Int.self, capacity: 1)
vertexTotal?.pointee = 0
computeEncoder.setBuffer(totalBuffer, offset: 0, index: 1)

➤ Hbizr ax tajlembPact(), ajp qziy peye su sfu funkonn kaxrin’d kudcnepeal raqkmuj:

print("Total Vertices:", vertexTotal?.pointee ?? -1)

➤ Ihax BakqanzRinb.zoray old iyk sgop deve mo wucsipz_qecy’v tudoquyimr:

device int &vertexTotal [[buffer(1)]],

vertexTotal++;

Xii obg ezi co juczekTuzic uahm weyo lwi xucnhuep ahezilur.

➤ Wguzh ij QocrucgLenf.hesal, nqamwa psu vujrisMopih venariqiq me:

device atomic_int &vertexTotal [[buffer(1)]],

Ocdseej iv if occ, duu zojuja an asilic_ohg, huztunv vdi FXU ytav yjuk zumn fanp ah chacol jigitc.

➤ Limsexe cafqisJilew++ refj:

atomic_fetch_add_explicit(&vertexTotal, 1, memory_order_relaxed);

Satxi dei neg’p wu porgda ejizomiajh az hni okenoq yebiazbu okblafe, xie wizg cka beuls-ug davwwiez cvux lesex av mevjozRiner in pce favgt secinifup opk wvi oziawl xe inm ab zyo hutipk zixizeyey.

Key Points

GPU compute, or general purpose GPU programming, helps you perform data operations in parallel without using the more specialized rendering pipeline.

You can move any task that operates on multiple items independently to the GPU. Later, you’ll see that you can even move the repetitive task of rendering a scene to a compute shader.

GPU memory is good at simple parallel operations, and with Apple Silicon, you can keep chained operations in tile memory instead of moving them back to system memory.

Compute processing uses a compute pipeline with a kernel function.

The kernel function operates on a grid of threads organized into threadgroups. This grid can be 1D, 2D or 3D.

Atomic functions allow inter-thread operations.

Have a technical question? Want to report a bug? You can ask questions and report bugs to the book authors in our official book forum here.

Chapters

Metal by Tutorials

Before You Begin

Section I: Beginning Metal

Section II: Intermediate Metal

Section III: Advanced Metal

Section IV: Ray Tracing

16. GPU Compute Programming
Written by Caroline Begbie & Marius Horga

The Starter Project

Winding Order and Culling

Reversing the Model on the CPU

Compute Processing

Threads and Threadgroups

Non-uniform Threadgroups

Threadgroups per Grid

Reversing the Warrior Using GPU Compute Processing

Setting up Threadgroups

Performing Code After Completing GPU Execution

The Kernel Function

Atomic Functions

Key Points

Chapters

Metal by Tutorials

Before You Begin

Section I: Beginning Metal

Section II: Intermediate Metal

Section III: Advanced Metal

Section IV: Ray Tracing

The Starter Project

Winding Order and Culling

Reversing the Model on the CPU

Compute Processing

Threads and Threadgroups

Non-uniform Threadgroups

Threadgroups per Grid

Reversing the Warrior Using GPU Compute Processing

Setting up Threadgroups

Performing Code After Completing GPU Execution

The Kernel Function

Atomic Functions

Key Points

Access this book