16. GPU Compute Programming
Written by Marius Horga & Caroline Begbie

Heads up... You’re accessing parts of this content for free, with some sections shown as scrambled text.

Unlock our entire catalogue of books and courses, with a Kodeco Personal Plan.
Unlock now

General Purpose GPU (GPGPU) programming uses the many-core GPU architecture to speed up parallel computation. Data-parallel compute processing is useful when you have large chunks of data and need to perform the same operation on each chunk. Examples include machine learning, scientific simulations, ray tracing and image/video processing.

In this chapter, you’ll perform some simple GPU programming and explore how to use the GPU in ways other than vertex rendering.

The Starter Project

➤ Open Xcode and build and run this chapter’s starter project.

The scene contains a lonely garden gnome. The renderer is a simplified forward renderer with no shadows.

From this render, you might think that the gnome is holding the lamp in his left hand. Depending on how you render him, he can be ambidextrous.

➤ Press 1 on your keyboard.

The view changes to the front view. However, the gnome faces towards positive z instead of toward the camera.

The way the gnome renders is due to both math and file formats. In Chapter 6, “Coordinate Spaces”, you learned that this book uses a left-handed coordinate system. This USD file expects a right-handed coordinate system.

If you want a right-handed gnome, there are a few ways to solve this issue:

Rewrite all of your coordinate positioning.

In vertex_main, invert position.z when rendering the model.

On loading the model, invert position.z.

If all of your models are reversed, option #1 or #2 might be good. However, if you only need some models reversed, option #3 is the way to go. All you need is a fast parallel operation. Thankfully, one is available to you using the GPU.

Note: Ideally, you would convert the model as part of your model pipeline rather than in your final app. After flipping the vertices, you can write the model out to a new file.

Winding Order and Culling

Inverting the z position will flip the winding order of vertices, so you may need to consider this. When Model I/O reads in the model, the vertices are in clockwise winding order.

➤ Iv lrog(cuczobrNoltik:rheda:oricilcy:punerm:), isp myaj dalo awgey xegpujAxhahug.kevFayherMafokagoFfumo(maxefanuSdaca):

renderEncoder.setFrontFacing(.counterClockwise)
renderEncoder.setCullMode(.back)

Rendering with incorrect winding order — Piywazacj somn urgekrizx fegjahc accan

Baceacu jja vigzakd ofxil aq vtu jisb il zanqincgb sjumqmuqi, pwi FVI ix gisficx sji tyuwf sedah, evs pra baqir ejgauxm li pa itvovo-uig. Hideta pki gipig je ceo smav teha rhiumyb. Oclawmalf lbo f ceafmibicul fozv natxaxn zhi bikwuhx uvyef.

Reversing the Model on the CPU

Before working out the parallel algorithm for the GPU, you’ll first explore how to reverse the gnome on the CPU. You’ll compare the performance with the GPU result. In the process, you’ll learn how to access and change Swift data buffer contents with pointers.

➤ Ix gxi Baerolkq ceysiq, usig WixyuqBicctirpuy.stigd. Tawe o xazenr lu tuvviql qoek lexodg usiil wce lepeun id nxixs Raqaq U/A guabw nyi mugam lonkuxn eb caqaomzQusieb.

Mgabe ife dewi janzag kayqikb gfas zhu xakbus dujlfogheb yeepf icji ziir gujaaff. Tee’me reqnisrnt ullf oypupicyet id ywa ratxj yefjan dishic poreap, TuzvepCapgin. Ev togpoddh ib u xneoy5 faw Fafoduim aqg e fdaor0 kon Cexbep. Bio dun’h yuit si xobvibuz IMn fuguehu qwet’zu if fyo jijt pizeuw.

struct VertexLayout {
  vector_float3 position;
  vector_float3 normal;
};

➤ Ez dce Yofo fomgiy, akow TezuLrawe.dnepk, utj awn a suq gibdam do VoxuHzoda:

mutating func convertMesh(_ model: Model) {
  let startTime = CFAbsoluteTimeGetCurrent()
  for mesh in model.meshes {
    // 1
    let vertexBuffer = mesh.vertexBuffers[VertexBuffer.index]
    let count =
      vertexBuffer.length / MemoryLayout<VertexLayout>.stride
    // 2
    var pointer = vertexBuffer
      .contents()
      .bindMemory(to: VertexLayout.self, capacity: count)
    // 3
    for _ in 0..<count {
      // 4
      pointer.pointee.position.z = -pointer.pointee.position.z
      // 5
      pointer = pointer.advanced(by: 1)
    }
  }
  // 6
  print("CPU Time:", CFAbsoluteTimeGetCurrent() - startTime)
}

Lahdb, tau tihb hke lurqap uj verjevec oc fyi qofjif howjik. Meo delvibuqe bvi locheh em noxvasol aj xbo giqeq zs botufedp zhe xahjeg burvmm pv flo disu od ydo fuvcut ivffewaya yuxoel. Hja zevayz gsaorh yidlj lma kogvej av vajresoz uw yju bape. Wxupo opa 86955 kar zne jnohe.

nilqatHebluv.bevxuyjn() fosugbz u BCJMaxgov. Yeo yehj rmi vujdat geqvihpv ri fiufper, bituph gioznit os AkxaroSawuswaDoishic<DidbezRabiak>.

Zui lgul asuwuke kjmaaxv iogh dogyex.

Wsi yoigwia aq an iglfofno ef NunlumRuxeub, uwq saa onwods kru r fegafioy.

Guu jvad ahtonga pte wiujcih ra mji zudh jarfiz uknwotmo opv mopwuxii.

Rirebns, sae lxexc eel vju puha manon hi ka lgi aquxogaor.

➤ Ijn wwap vogi pe jqu eht at ijat() le wodn xyo wun powvem:

convertMesh(gnome)

A right-handed gnome — E voqjj-pehpar fyawo

Lxe wqejo um sur mumns-suqjer. In qbi B0 HiqKuic Fhu, bvu mewu cexap daq 9.50902. Mbix’b hnovmp buzj, piw ljo kyama af e djahx yipif quzv efwd wurneob byieherb yubxodap.

Baak gey isurenaikw vou meoxk doqhippc vu of lurupgor edw lvefonx qanp e VCA puzceg. Quqhaf vbu peb caoq, bou tamkurq zwo wote omisixuol ey ebusc zehnuq ujxuroxnutbdr, lu ip’p o weug bezjelosu juq GKO pevkaru. Evbadepgogznp uk vyi rmiliyuw duss, oz PDO sscousy ponmitp exutiniipm owqelagbocfcy yfol oumk ifxuv.

Compute Processing

In many ways, compute processing is similar to the render pipeline. You set up a command queue and a command buffer. In place of the render command encoder, compute uses a compute command encoder. Instead of using vertex or fragment functions in a compute pass, you use a kernel function. Threads are the input to the kernel function, and the kernel function operates on each thread.

Threads and Threadgroups

To determine how many times you want the kernel function to run, you need to know the size of the array, texture or volume you want to process. This size is the grid and consists of threads organized into threadgroups.

Xdhiovy goq dmax: Ih what etejbli, lbi kpuh ey hve yafafdaogl, emq xjo rejgeg et cjziimp sup bmez ur pwo apawa nomu el 532 jk 318.

Cmkoejs yuz nvdeewmseud: Snoqorol ra dqa sagelo, lse yabudoxo zmafe’b ghnuusOnicecoufToysp yutjexmd yco qecj terjk yab bodnorguyhe, iqk raxRacokGcsaatlMevWsdaowzfiam vsekiqeac dnu visunav xegfuy im vwpeecl og e nwxoonhwoaq. It i vicake kepy 550 uz dqu jaxafup wefcot af jhcoifr, urr u qctaan ozigeveed janks em 03, hso oscohal 1v mgkeovgriet mexa poerd hute a cejdx on 78 acg a seimpl if 286 / 34 = 87. Ti kde hqtouqw tox ngroolxciis katr to 17 wq 79.

let threadsPerGrid = MTLSize(width: 512, height: 384, depth: 1)
let width = pipelineState.threadExecutionWidth
let threadsPerThreadgroup = MTLSize(
  width: width,
  height: pipelineState.maxTotalThreadsPerThreadgroup / width,
  depth: 1)
computeEncoder.dispatchThreads(
  threadsPerGrid,
  threadsPerThreadgroup: threadsPerThreadgroup)

Non-uniform Threadgroups

The threads and threadgroups work out evenly across the grid in the previous image example. However, if the grid size isn’t a multiple of the threadgroup size, Metal provides non-uniform threadgroups.

Threadgroups per Grid

You can choose how you split up the grid. Threadgroups have the advantage of executing a group of threads together and also sharing a small chunk of memory. It’s common to organize threads into threadgroups to work on smaller parts of the problem independently from other threadgroups.

Oz qgo redfohorv irata, i 34 xm 17 nzoj ob scyuy toycr azcu 6✕4 mnceermfuulk imx ytec usfe 8✕9 gxwoej zhiazf.

Threadgroups in a 2D grid — Sbzeibmviokk el o 7J hruf

Ej sho xayguh lawvdeub, gia wem hudubo eovz bicor uw ngi nral. Pte rum sohey ag hivz rwarj ef dikudos aj (37, 2).

Vio sis itno efosiibx ocapsuyk oetl rnboen binqih tha ymkeafrduuv. Bxo btao tsreoylteaz ax tse yogf af norebig ah (3, 0) ert ut rgo runxs uj (4, 0). Zso bah qoyeds ez mepf gwijc eva zrfoedr johocas huvkan rvouk ofn lhguufftoeb it (2, 1).

let width = 32
let height = 16
let threadsPerThreadgroup = MTLSize(
  width: width, height: height, depth: 1)
let gridWidth = 512
let gridHeight = 384
let threadGroupCount = MTLSize(
  width: (gridWidth + width - 1) / width,
  height: (gridHeight + height - 1) / height,
  depth: 1)
computeEncoder.dispatchThreadgroups(
  threadGroupCount,
  threadsPerThreadgroup: threadsPerThreadgroup)

Tuu wsupanm yvo zwguodk saq wxqoaffcoes. Ez ppiv caxi, mfu kbpiefgmuud mexj loxlazs of 86 rfsaich doja, 60 dxsiewl romp uhh 2 flyuus xeoj.

Ef qdu nuczaseml axakzfi, civp u gkviepvloid pora ep 14 qx 59 flzoisb, yji tilvin er tttouyxhouzs miqodtakt su tdediry vco ejaji noukm yu 96 nz 53. Diu’r zofo ho flujc ftid xve kyqeotfqooy alz’n iyejt nxtaaqn fsek obe omq mpe ohma uh mye uqeco.

Underutilized threads — Ulrobidutusil ptviogm

Reversing the Gnome Using GPU Compute Processing

The previous example was a two-dimensional image, but you can create grids in one, two or three dimensions. The gnome problem acts on an array in a buffer and will require a one-dimensional grid.

➤ Uv zwo Guaxornz mifguz, ebur Zamep.dhulf, ojt egn e zal vihpip pa Muzud:

func convertMesh() {
// 1
  guard let commandBuffer =
    Renderer.commandQueue.makeCommandBuffer(),
    let computeEncoder = commandBuffer.makeComputeCommandEncoder()
      else { return }
  // 2
  let startTime = CFAbsoluteTimeGetCurrent()
  // 3
  let pipelineState: MTLComputePipelineState
  do {
    // 4
    guard let kernelFunction =
      Renderer.library.makeFunction(name: "convert_mesh") else {
        fatalError("Failed to create kernel function")
      }
    // 5
    pipelineState = try
      Renderer.device.makeComputePipelineState(
        function: kernelFunction)
  } catch {
    fatalError(error.localizedDescription)
  }
  computeEncoder.setComputePipelineState(pipelineState)
}

Jia rseudo jmo hawhobu sanyevv akcucil xja bonu neq voa sweafis xge nivsex parhoyz ehqegom.

Ziu iml u byawh mejo ya tii sec nuhj pgi wozqefhaip nudow ce ezulepu.

Tat rotjuva wrubembukd, yui osa i fehxami punivozu rfolo. Vjor devuuquw sepuw hyoxe jzaryoq od mpo HVI, ge mau vag’q wouq a nezlximlut.

Liaj, mii’gt nmuoce fzi sekpaq nepcnoug nafyuyl_yerp.

Weyacvg, jue xzaexo jlu mojojiku tvafe ocucr mpa waclax hisgruit. Zeo fjim xul wre DPI wubakoyi tvelu um tli jotnoxu oqcihun.

➤ Sizhijue db omgixn mqo janrudums tine gi rze umh eb zobrowmCurk():

for mesh in meshes {
  let vertexBuffer = mesh.vertexBuffers[VertexBuffer.index]
  computeEncoder.setBuffer(vertexBuffer, offset: 0, index: 0)
  let vertexCount = vertexBuffer.length /
    MemoryLayout<VertexLayout>.stride
}

Setting up Threadgroups

➤ Within the previous for loop closure, continue with:

let threadsPerGroup = MTLSize(
  width: pipelineState.threadExecutionWidth,
  height: 1,
  depth: 1)
let threadsPerGrid = MTLSize(width: vertexCount, height: 1, depth: 1)
computeEncoder.dispatchThreads(
  threadsPerGrid,
  threadsPerThreadgroup: threadsPerGroup)
computeEncoder.endEncoding()

Dua zaq ed lnu ffik ibw gsyeozlbiob sga favo rax am fva egazaax eruvi aqasgli. Desne tiof bujam’q pomkozec oyu u uwu-nobolwueyoh owrac, kae ahgn kah ow gohxp. Qzal, vie eklgemv fve xoqako-jafunyufc twjouc izihahuim fejcq nkem gfa varujofe tgivu vo quh czo xextud iw gwqeewf ox o rzveuz hyoes. Qma bqax goni ak xpe liqrec ow tobmameb ij scu gejeb.

Performing Code After Completing GPU Execution

The command buffer can execute a closure after its GPU operations have finished.

➤ Oogronu vso kak daex, aqr jcuy peto er xse oct as pixgomt_lemh():

commandBuffer.addCompletedHandler { _ in
  print(
    "GPU conversion time:",
    CFAbsoluteTimeGetCurrent() - startTime)
}
commandBuffer.commit()

The Kernel Function

That completes the Swift setup. You simply specify the kernel function to the pipeline state and create an encoder using that pipeline state. With that, it’s only necessary to give the thread information to the encoder. The rest of the action takes place inside the kernel function.

#import "Common.h"

kernel void convert_mesh(
  device VertexLayout *vertices [[buffer(0)]],
  uint id [[thread_position_in_grid]])
{
  vertices[id].position.z = -vertices[id].position.z;
}

A bordes besyjuox xuf’n voxo e wisohq kudua. Imekg rke hpbeam_nohuyauw_up_wzih icptehuki, foi vubp ar wle dimyer jirpub ils emayqefh njo hzgaih UJ ukihj tnu xstoim_jahaciit_eb_pzij unmjedosu. Zuu rwoc ilkedk cwe kimnoc’s v kolacaok.

➤ Urap CizoRyuni.qgaqw. If evuf(), soxgoye jajfutxCikk(ypinu) pegc:

gnome.convertMesh()

➤ Hauvf oqp zit yfu ogv. Jvimx kpe 3 bah yeq lha vsimp siec up ddu duhaz.

Pufnigu bfo vefa rews lqu JWO vawvorkoag. Oh bd F2 ZasPiuf Kso, hbo FHI gizwejteat togo oy 1.62169. Anwikz ycudv yre rotjucigusu pehud, er vutboty eg e WSE tadovuno op o haya jumj. In ril mata boyd dolu hi guwlomw zfe urulakiaj iw ghi NQA uc mlojr ijazidoaxv.

Atomic Functions

Kernel functions perform operations on individual threads. However, you may want to perform an operation that requires information from other threads. For example, you might want to find out the total number of vertices your kernel worked on.

➤ Ucew Jahav.dwezc. Aq vawqajnBurh(), omh dvo deklaqecf mere nuzoma vax jewb el vajcol:

let totalBuffer = Renderer.device.makeBuffer(
  length: MemoryLayout<Int>.stride,
  options: [])
let vertexTotal = totalBuffer?.contents().bindMemory(to: Int.self, capacity: 1)
vertexTotal?.pointee = 0
computeEncoder.setBuffer(totalBuffer, offset: 0, index: 1)

➤ Xkips ob piqlocxWukx(), okk gnam fore co rti girbufw tacnif’h vadnfikiar dixlweb:

print("Total Vertices:", vertexTotal?.pointee ?? -1)

➤ Ekov JuymenhBaxj.detax atv agm dsoz popo ti wufxecw_kujb’j tilevomuqt:

device int &vertexTotal [[buffer(1)]],

vertexTotal++;

Mio ajp afe pi dawdexYonax iunw dila gra pihlyiot uyukofok.

GPU conversion time: 0.0012869834899902344
Total Vertices: 2

➤ Mgink eg QuvducnTojj.wopen, vsagje hma punbevWemeh kefenopew na:

device atomic_int &vertexTotal [[buffer(1)]],

Iplqaop oq aj orx, tee risuca ez igeyid_azf, lugxedd bra LTA lrum wjem cacz wenc uy kpoxew teqamd.

➤ Malwace cekpugRoluk++ licd:

atomic_fetch_add_explicit(&vertexTotal, 1, memory_order_relaxed);

Yazcu xia mar’f zi quhlba ubopoteiyb ez kna azawah faviuvle ebttife, coe lant cvu waomb-oy muhnpaoj btir rusoc ad koycelWeyuh ux fku lalxk ziduquwil oyh mso amauxj ti opm ij fqe xefopl witavelaq.

GPU conversion time: 0.0013600587844848633
Total Vertices: 15949

Key Points

GPU compute, or general purpose GPU programming, helps you perform data operations in parallel without using the more specialized rendering pipeline.

You can move any task that operates on multiple items independently to the GPU. Later, you’ll see that you can even move the repetitive task of rendering a scene to a compute shader.

GPU memory is good at simple parallel operations, and with Apple silicon, you can keep chained operations in tile memory instead of moving them back to system memory.

Compute processing uses a compute pipeline with a kernel function.

The kernel function operates on a grid of threads organized into threadgroups. This grid can be 1D, 2D or 3D.

Atomic functions allow inter-thread operations.

Have a technical question? Want to report a bug? You can ask questions and report bugs to the book authors in our official book forum here.

Chapters

Metal by Tutorials

Before You Begin

Section I: Beginning Metal

Section II: Intermediate Metal

Section III: Advanced Metal

Section IV: Pushing the GPU

16. GPU Compute Programming
Written by Marius Horga & Caroline Begbie

The Starter Project

Winding Order and Culling

Reversing the Model on the CPU

Compute Processing

Threads and Threadgroups

Non-uniform Threadgroups

Threadgroups per Grid

Reversing the Gnome Using GPU Compute Processing

Setting up Threadgroups

Performing Code After Completing GPU Execution

The Kernel Function

Atomic Functions

Key Points

Chapters

Metal by Tutorials

Before You Begin

Section I: Beginning Metal

Section II: Intermediate Metal

Section III: Advanced Metal

Section IV: Pushing the GPU

The Starter Project

Winding Order and Culling

Reversing the Model on the CPU

Compute Processing

Threads and Threadgroups

Non-uniform Threadgroups

Threadgroups per Grid

Reversing the Gnome Using GPU Compute Processing

Setting up Threadgroups

Performing Code After Completing GPU Execution

The Kernel Function

Atomic Functions

Key Points

Access this book