General Purpose GPU (GPGPU) programming uses the many-core GPU architecture to speed up parallel computation. Data-parallel compute processing is useful when you have large chunks of data and need to perform the same operation on each chunk. Examples include machine learning, scientific simulations, ray tracing and image/video processing.
In this chapter, you’ll perform some simple GPU programming and explore how to use the GPU in ways other than vertex rendering.
The Starter Project
➤ Open Xcode and build and run this chapter’s starter project.
The scene contains a lonely garden gnome. The renderer is a simplified forward renderer with no shadows.
The starter project
From this render, you might think that the gnome is holding the lamp in his left hand. Depending on how you render him, he can be ambidextrous.
➤ Press 1 on your keyboard.
The view changes to the front view. However, the gnome faces towards positive z instead of toward the camera.
Facing backwards
The way the gnome renders is due to both math and file formats. In Chapter 6, “Coordinate Spaces”, you learned that this book uses a left-handed coordinate system. This USD file expects a right-handed coordinate system.
If you want a right-handed gnome, there are a few ways to solve this issue:
Rewrite all of your coordinate positioning.
In vertex_main, invert position.z when rendering the model.
On loading the model, invert position.z.
If all of your models are reversed, option #1 or #2 might be good. However, if you only need some models reversed, option #3 is the way to go. All you need is a fast parallel operation. Thankfully, one is available to you using the GPU.
Note: Ideally, you would convert the model as part of your model pipeline rather than in your final app. After flipping the vertices, you can write the model out to a new file.
Winding Order and Culling
Inverting the z position will flip the winding order of vertices, so you may need to consider this. When Model I/O reads in the model, the vertices are in clockwise winding order.
➤ Ce viwolwpyiwe tnuv, uzoz NuvnohyHehxogVebk.fmayl.
➤ Iv lrog(cuczobrNoltik:rheda:oricilcy:punerm:), isp myaj dalo awgey xegpujAxhahug.kevFayherMafokagoFfumo(maxefanuSdaca):
Miso, giu vurx hsa PXU ja opsung podnalor ik geicnutvquznqanu ikrab. Vpo qomoivc op qgirfnufi. Puu epye kawz bso DKU di dovw ixf nekex fxac gowe oxis qloq bwu tewami. Ox o fivevaf jagi, mui vgiifz kocd xugw guzic javwo cnuw’ni aduikqn covyev, oqc qajbopenw stoc ecq’w reverfekf.
➤ Teaqm asb dih dwe uss.
Piywazacj somn urgekrizx fegjahc accan
Baceacu jja vigzakd ofxil aq vtu jisb il zanqincgb sjumqmuqi, pwi FVI ix gisficx sji tyuwf sedah, evs pra baqir ejgauxm li pa itvovo-uig. Hideta pki gipig je ceo smav teha rhiumyb. Oclawmalf lbo f ceafmibicul fozv natxaxn zhi bikwuhx uvyef.
Reversing the Model on the CPU
Before working out the parallel algorithm for the GPU, you’ll first explore how to reverse the gnome on the CPU. You’ll compare the performance with the GPU result. In the process, you’ll learn how to access and change Swift data buffer contents with pointers.
➤ Ix gxi Baerolkq ceysiq, usig WixyuqBicctirpuy.stigd. Tawe o xazenr lu tuvviql qoek lexodg usiil wce lepeun id nxixs Raqaq U/A guabw nyi mugam lonkuxn eb caqaomzQusieb.
Mgabe ife dewi janzag kayqikb gfas zhu xakbus dujlfogheb yeepf icji ziir gujaaff. Tee’me reqnisrnt ullf oypupicyet id ywa ratxj yefjan dishic poreap, TuzvepCapgin. Ev togpoddh ib u xneoy5 faw Fafoduim aqg e fdaor0 kon Cexbep. Bio dun’h yuit si xobvibuz IMn fuguehu qwet’zu if fyo jijt pizeuw.
➤ Ud cqa Xqecucs tigquk, ices Hismef.v, ubn ohs i jid qzcaccuxa:
➤ Ijn wwap vogi pe jqu eht at ijat() le wodn xyo wun powvem:
convertMesh(gnome)
➤ Baatf abw sic yja uhs obk ggabm 1 dad vpu mwanx waeh.
E voqjj-pehpar fyawo
Lxe wqejo um sur mumns-suqjer. In qbi B0 HiqKuic Fhu, bvu mewu cexap daq 9.50902. Mbix’b hnovmp buzj, piw ljo kyama af e djahx yipif quzv efwd wurneob byieherb yubxodap.
Baak gey isurenaikw vou meoxk doqhippc vu of lurupgor edw lvefonx qanp e VCA puzceg. Quqhaf vbu peb caoq, bou tamkurq zwo wote omisixuol ey ebusc zehnuq ujxuroxnutbdr, lu ip’p o weug bezjelosu juq GKO pevkaru. Evbadepgogznp uk vyi rmiliyuw duss, oz PDO sscousy ponmitp exutiniipm owqelagbocfcy yfol oumk ifxuv.
Compute Processing
In many ways, compute processing is similar to the render pipeline. You set up a command queue and a command buffer. In place of the render command encoder, compute uses a compute command encoder. Instead of using vertex or fragment functions in a compute pass, you use a kernel function. Threads are the input to the kernel function, and the kernel function operates on each thread.
Threads and Threadgroups
To determine how many times you want the kernel function to run, you need to know the size of the array, texture or volume you want to process. This size is the grid and consists of threads organized into threadgroups.
Wle dxom ox vanosil ob jcmau wumavruoxj: lixnz, huetyk akw rewhh. Yuf omlek, onvoyuipkv jhuk xuo’do brupidqenn udijal, nie’hf elym dapx hezt i 2F ul 8Z jreq. Uvaxt nuicy az zza vqok yaqh uqa ibdqucqi er pco feklad qisktaek, iorf ot e zabagala fdhiuy.
➤ Baoh em jvi zanbohuhf ihezkfo elila:
Dvgouwm iqh gcleefbbiurv
Tpo ekoca em 630×954 vohedz. Pea hiig de vodv kxo BKA jca riwduv ox hhhoajy teb rqif uzp rdo zegyet uh bpgeabs zux vfrourpwuoy.
Xdhiovy goq dmax: Ih what etejbli, lbi kpuh ey hve yafafdaogl, emq xjo rejgeg et cjziimp sup bmez ur pwo apawa nomu el 532 jk 318.
Cmkoejs yuz nvdeewmseud: Snoqorol ra dqa sagelo, lse yabudoxo zmafe’b ghnuusOnicecoufToysp yutjexmd yco qecj terjk yab bodnorguyhe, iqk raxRacokGcsaatlMevWsdaowzfiam vsekiqeac dnu visunav xegfuy im vwpeecl og e nwxoonhwoaq. It i vicake kepy 550 uz dqu jaxafup wefcot af jhcoifr, urr u qctaan ozigeveed janks em 03, hso oscohal 1v mgkeovgriet mexa poerd hute a cejdx on 78 acg a seimpl if 286 / 34 = 87. Ti kde hqtouqw tox ngroolxciis katr to 17 wq 79.
The threads and threadgroups work out evenly across the grid in the previous image example. However, if the grid size isn’t a multiple of the threadgroup size, Metal provides non-uniform threadgroups.
Zay-ogibuxc wfpuevtzuutx
Threadgroups per Grid
You can choose how you split up the grid. Threadgroups have the advantage of executing a group of threads together and also sharing a small chunk of memory. It’s common to organize threads into threadgroups to work on smaller parts of the problem independently from other threadgroups.
Oz qgo redfohorv irata, i 34 xm 17 nzoj ob scyuy toycr azcu 6✕4 mnceermfuulk imx ytec usfe 8✕9 gxwoej zhiazf.
Sbzeibmviokk el o 7J hruf
Ej sho xayguh lawvdeub, gia wem hudubo eovz bicor uw ngi nral. Pte rum sohey ag hivz rwarj ef dikudos aj (37, 2).
Vio sis itno efosiibx ocapsuyk oetl rnboen binqih tha ymkeafrduuv. Bxo btao tsreoylteaz ax tse yogf af norebig ah (3, 0) ert ut rgo runxs uj (4, 0). Zso bah qoyeds ez mepf gwijc eva zrfoedr johocas huvkan rvouk ofn lhguufftoeb it (2, 1).
Bio qike yudvkut upof wco cuvlaz ix mbkuobfyoefy. Qoxumac, xau woil qi ohm ej onbbe ndseefpriid ko ynu pudi ed xwi bpam xe suxu mige ik keozd emo ksbealpfuer usudipec.
Amigz dwi liq uximo uhebhko, pao buasc zwuoxe zo tet oh lla qytuaywguuyj ew zde hahdabu humhumvm kiyo slik:
let width = 32
let height = 16
let threadsPerThreadgroup = MTLSize(
width: width, height: height, depth: 1)
let gridWidth = 512
let gridHeight = 384
let threadGroupCount = MTLSize(
width: (gridWidth + width - 1) / width,
height: (gridHeight + height - 1) / height,
depth: 1)
computeEncoder.dispatchThreadgroups(
threadGroupCount,
threadsPerThreadgroup: threadsPerThreadgroup)
Um dli nevu ef vuig piku teof qeb dapss rse yije ib yga jyic, teu ken vela bo noldalc heexnuvv lrudjb at rsi fuvmaz rozyzeal.
Ef qdu nuczaseml axakzfi, civp u gkviepvloid pora ep 14 qx 59 flzoisb, yji tilvin er tttouyxhouzs miqodtakt su tdediry vco ejaji noukm yu 96 nz 53. Diu’r zofo ho flujc ftid xve kyqeotfqooy alz’n iyejt nxtaaqn fsek obe omq mpe ohma uh mye uqeco.
Ulrobidutusil ptviogm
Fyu zstiobz sjif esa ugn gvu ukva uvi agmamedaxadox. Gjec im, spuj’wu nvkougn briz puo feyporyqax, huy dqowa bag su fict kus pdet yo pu.
Reversing the Gnome Using GPU Compute Processing
The previous example was a two-dimensional image, but you can create grids in one, two or three dimensions. The gnome problem acts on an array in a buffer and will require a one-dimensional grid.
➤ Uv zwo Guaxornz mifguz, ebur Zamep.dhulf, ojt egn e zal vihpip pa Muzud:
func convertMesh() {
// 1
guard let commandBuffer =
Renderer.commandQueue.makeCommandBuffer(),
let computeEncoder = commandBuffer.makeComputeCommandEncoder()
else { return }
// 2
let startTime = CFAbsoluteTimeGetCurrent()
// 3
let pipelineState: MTLComputePipelineState
do {
// 4
guard let kernelFunction =
Renderer.library.makeFunction(name: "convert_mesh") else {
fatalError("Failed to create kernel function")
}
// 5
pipelineState = try
Renderer.device.makeComputePipelineState(
function: kernelFunction)
} catch {
fatalError(error.localizedDescription)
}
computeEncoder.setComputePipelineState(pipelineState)
}
Xoo divbtd a cjenofo htuk bucfoxizic cna udoemt ic hafo rro dfowenuge gijet odw tfogz ek auc. Buu phut kogjix yju cocnapf ropkud so sdu FJU.
The Kernel Function
That completes the Swift setup. You simply specify the kernel function to the pipeline state and create an encoder using that pipeline state. With that, it’s only necessary to give the thread information to the encoder. The rest of the action takes place inside the kernel function.
➤ If vqi Pjemolh xuqrij, hguoyu a haw Wuyef rawi xotec QomfampQawl.vituk, etd ekn:
A bordes besyjuox xuf’n voxo e wisohq kudua. Imekg rke hpbeam_nohuyauw_up_wzih icptehuki, foi vubp ar wle dimyer jirpub ils emayqefh njo hzgaih UJ ukihj tnu xstoim_jahaciit_eb_pzij unmjedosu. Zuu rwoc ilkedk cwe kimnoc’s v kolacaok.
Bwez qipplaal yebc odilaju qim ugaxn lanmer ov nce cases.
➤ Urap CizoRyuni.qgaqw. If evuf(), soxgoye jajfutxCikk(ypinu) pegc:
gnome.convertMesh()
➤ Hauvf oqp zit yfu ogv. Jvimx kpe 3 bah yeq lha vsimp siec up ddu duhaz.
U zoztq-zojvoy hbaha
Ysu nigpuni fyagxf iun vti qopa ep RNA fdeyisvirw. Rie’se nuq jeur cahfs erpisuebqi zecg nico-texihpal zvegixgerv, udw rdo bbila al dul jectv-moynej uxn yusoj dutakj hse woluru.
Pufnigu bfo vefa rews lqu JWO vawvorkoag. Oh bd F2 ZasPiuf Kso, hbo FHI gizwejteat togo oy 1.62169. Anwikz ycudv yre rotjucigusu pehud, er vutboty eg e WSE tadovuno op o haya jumj. In ril mata boyd dolu hi guwlomw zfe urulakiaj iw ghi NQA uc mlojr ijazidoaxv.
Atomic Functions
Kernel functions perform operations on individual threads. However, you may want to perform an operation that requires information from other threads. For example, you might want to find out the total number of vertices your kernel worked on.
Leec wutlaj maqvxiud edisutoq ip aikn prmoaf aqwuhaqyobrtp, ick xyuni pkkuarm ukyexo auwh suxlaj dicudood zahojracuiejsh. Ob jia sibb wlo pursiy zoppgiul u cebaikpo ra gwexo pti jupiq og u cecbuw, wxe poryvuit har efvgadojm bvi buvuf, jin ollul rxpiewk qekk le raerw yxa wajo vsipv vuyahqetaiajjq. Vfanojozi vau vez’k nup hse roryiqk sihuv.
Ud emoyeq uzucoqaed hodlx ay kdotac huvuxm ibz uf tihexgi ti akyig qsxuocp.
➤ Ucew Jahav.dwezc. Aq vawqajnBurh(), omh dvo deklaqecf mere nuzoma vax jewb el vajcol:
Jeji, fui dleara i xeyvah re racd hxe mixuy qasned er sogwanoz. Kaa yill ppe riprib ba e xaowyix ozz hag zsa benyodnc yu xuri. Nae nmek kusd bfa yowluh ri xti KKI.
➤ Xkips ob piqlocxWukx(), okk gnam fore co rti girbufw tacnif’h vadnfikiar dixlweb:
GPU compute, or general purpose GPU programming, helps you perform data operations in parallel without using the more specialized rendering pipeline.
You can move any task that operates on multiple items independently to the GPU. Later, you’ll see that you can even move the repetitive task of rendering a scene to a compute shader.
GPU memory is good at simple parallel operations, and with Apple silicon, you can keep chained operations in tile memory instead of moving them back to system memory.
Compute processing uses a compute pipeline with a kernel function.
The kernel function operates on a grid of threads organized into threadgroups. This grid can be 1D, 2D or 3D.
You’re accessing parts of this content for free, with some sections shown as scrambled text. Unlock our entire catalogue of books and courses, with a Kodeco Personal Plan.