Hide chapters

Metal by Tutorials

Before You Begin

Section 0: 3 chapters
Show chapters Hide chapters

Section I: The Player

Section 1: 8 chapters
Show chapters Hide chapters

Section III: The Effects

Section 3: 10 chapters
Show chapters Hide chapters

24. Performance Optimization
Written by Marius Horga

Heads up... You're reading this book for free, with parts of this chapter shown beyond this point as scrambled text.

In the previous chapter, you took a first stab at optimizing your app by profiling your shaders and using Instruments to find even more bottlenecks to get rid of. In this chapter, you’ll look at:

  1. CPU-GPU Synchronization
  2. Multithreading
  3. GPU Families
  4. Memory Management
  5. Best Practices

CPU-GPU synchronization

Always aim to minimize the idle time between frames.

Managing dynamic data can be a little tricky. Take the case of Uniforms. You’re changing them usually once per frame on the CPU. That means that the GPU has to wait until the CPU has finished writing the buffer before it can read the buffer. Instead, you can simply have a pool of reusable buffers.

Triple buffering is a well-known technique in the realm of synchronization. The idea is to use three buffers at a time. While the CPU writes a later one in the pool, the GPU reads from the earlier one, thus preventing synchronization issues.

You might ask, why three and not just two or a dozen? With only two buffers, there’s a high risk that the CPU will try to write the first buffer again before the GPU finished reading it even once. With too many buffers, there’s a high risk of performance issues.

Before you implement the triple buffering, use Instruments to run a Metal System Trace (MST) session and get a baseline level of the CPU activity:

Notice that most tasks peak at about 10% and this is fine, assuming that the GPU has enough work to do on its own without waiting for more work from the CPU.

All right, time to implement that triple buffering pool like a champ!

Open the starter project that comes with this chapter. In Scene.swift, replace this line:

var uniforms = Uniforms()

With this code:

static let buffersInFlight = 3
var uniforms = [Uniforms](repeating: Uniforms(), 
                          count: buffersInFlight)
var currentUniformIndex = 0

Here, you replaced the uniforms variable with an array of three buffers and defined an index to keep track of the current buffer in use.

In update(deltaTime:), replace this code:

uniforms.projectionMatrix = camera.projectionMatrix
uniforms.viewMatrix = camera.viewMatrix

With this:

uniforms[currentUniformIndex].projectionMatrix = 
uniforms[currentUniformIndex].viewMatrix = camera.viewMatrix
currentUniformIndex = 
    (currentUniformIndex + 1) % Scene.buffersInFlight

Here, you adapted the update method to include the new uniforms array and created a way to have the index loop around always taking the values 0, 1 and 2.

Back in Renderer.swift, add this line to draw(in:), before the renderables loop:

let uniforms = scene.uniforms[scene.currentUniformIndex]

Replace scene.uniforms with uniforms in the two places Xcode complains about.

Build and run the project. It’ll show the same scene as before. Run another MST session and notice that now the CPU activity has increased.

This is both good news and bad news. It’s good news because that means the GPU is not getting more work to do. The bad news is that now the CPU and the GPU will spar over using the same resources.

This is known as resource contention and involves conflicts, called race conditions, over accessing shared resources by both the CPU and GPU. They’re trying to read/write the same uniform, causing unexpected results.

In the image below, the CPU is ready to start writing the third buffer again. However, that would require the GPU to have finished reading it, which is not the case here.

What you need here is a way to delay the CPU writing until the GPU has finished reading it.

In Chapter 8, “Character Animation,” you solved this synchronization issue in a naive way by using waitUntilCompleted() on your command buffer. A more performant way, however, is the use of a synchronization primitive called a semaphore, which is a convenient way of keeping count of the available resources — your triple buffer in this case.

Here’s how a a semaphore works:

  • Initialize it to a maximum value that represents the number of resources in your pool (3 buffers here).
  • Inside the draw call the thread tells the CPU to wait until a resource is available and if one is, it takes it and decrements the semaphore value by one.
  • If there are no more available resources, the current thread is blocked until the semaphore has at least one resource available.
  • When a thread finishes using the resource, it’ll signal the semaphore by increasing its value and by releasing the hold on the resource.

Time to put this theory into practice.

At the top of Renderer, add this new property:

var semaphore: DispatchSemaphore

In init(metalView:), add this line before super.init():

semaphore = DispatchSemaphore(value: Scene.buffersInFlight)

Add this line at the top of draw(in:):

_ = semaphore.wait(timeout: .distantFuture)

At the end of draw(in:), but before committing the command buffer, add this:

commandBuffer.addCompletedHandler { _ in

At the end of draw(in:), remove:


Build and run the project again, making sure everything still renders fine as before.

Run another MST session and compare the performance metrics with the previous ones.

If you look at the GFX bar under your specific graphics processor, the gaps are all narrower now because the GPU is not sitting idle as much as it was sitting before. You can intensify the rendering workload by increasing the number of trees, rocks or grass blades, and then the gaps might be completely gone. Those “Thread blocked waiting for next drawable” messages are also gone.

Notice an old issue you did not fix yet. Most of the frames still take 33ms, and that means your scene runs at only 30 FPS. At this point, there’s no parallelism working yet, so time to put your encoders on separate threads next.


Build all known pipelines up front and asynchronously.

let commandBuffer = Renderer.commandQueue.makeCommandBuffer()
let descriptor = MTLRenderPassDescriptor()
let parallelEncoder = commandBuffer.makeParallelRenderCommandEncoder(
                                    descriptor: descriptor)
let encoder1 = parallelEncoder.makeRenderCommandEncoder()
// ... encoder1.draw() ...
let encoder2 = parallelEncoder.makeRenderCommandEncoder()
// ... encoder2.draw() ...

let dispatchQueue = DispatchQueue(label: "Queue", 
                                  attributes: .concurrent)
guard let computeEncoder = 
else {
guard let computeCommandBuffer = 
      let computeEncoder = 
        computeCommandBuffer.makeComputeCommandEncoder() else {
commandBuffer.addCompletedHandler { _ in
// 1
// 2
dispatchQueue.async(execute: commandBuffer.commit)
weak var sem = semaphore
dispatchQueue.async {
  computeCommandBuffer.addCompletedHandler { _ in
// 3
__dispatch_barrier_sync(dispatchQueue) {}

GPU families

GPU families are classes of GPUs categorized by device and/or build target type. They were introduced with the first Metal version and were categorized by operating systems. At WWDC 2019 Apple repurposed and renamed them as follows:

let devices = MTLCopyAllDevices()
for device in devices {
  if #available(macOS 10.15, *) {
    if device.supportsFamily(.mac2) {
      print("\( is a Mac 2 family gpu running on macOS Catalina.")
    else {
      print("\( is a Mac 1 family gpu running on macOS Catalina.")
  else {
    if device.supportsFeatureSet(.macOS_GPUFamily2_v1) {
      print("You are using a recent GPU with an older version of macOS.")
    else {
      print("You are using an older GPU with an older version of macOS.")
AMD Radeon RX Vega 64 is a Mac 2 family gpu running on macOS Catalina.
Intel(R) HD Graphics 530 is a Mac 2 family gpu running on macOS Catalina.
AMD Radeon Pro 450 is a Mac 2 family gpu running on macOS Catalina.

Memory management

Whenever you create a buffer or a texture, you should consider how to configure it for fast memory access and driver performance optimizations. Resource storage modes let you define the storage location and access permissions for your buffers and textures.

vertex Vertices vertex_func(
  const device Vertices *vertices [[buffer(0)]], 
  constant Uniforms &uniforms [[buffer(1)]], 
  uint vid [[vertex_id]]) {}

Best practices

When you are after squeezing the very last ounce of performance from your app, you should always remember to follow a golden set of best practices. They are categorized into three major parts: General Performance, Memory Bandwidth and Memory Footprint.

General performance best practices

The next five best practices are general and apply to the entire pipeline.

create off-screen command buffer
encode work for the GPU
commit off-screen command buffer
get the drawable
create on-screen command buffer
encode work for the GPU
present the drawable
commit on-screen command buffer

Memory Bandwidth best practices

Since memory transfers for render targets and textures are costly, the next six best practices are targeted to memory bandwidth and how to use shared and tiled memory more efficiently.

textureDescriptor.storageMode = .private 
textureDescriptor.usage = [ .shaderRead, .renderTarget ]
let texture = device.makeTexture(descriptor: textureDescriptor)
textureDescriptor.storageMode = .shared 
textureDescriptor.usage = .shaderRead
let texture = device.makeTexture(descriptor: textureDescriptor)
// update texture data
texture.replace(region: region, mipmapLevel: 0, 
                withBytes: bytes, 
                bytesPerRow: bytesPerRow)
let blitCommandEncoder = commandBuffer.makeBlitCommandEncoder()
                       texture: texture) 
renderPassDescriptor.colorAttachments[0].loadAction = .clear 
renderPassDescriptor.colorAttachments[0].storeAction = .dontCare

textureDescriptor.textureType = .type2DMultisample 
textureDescriptor.sampleCount = 4 
textureDescriptor.storageMode = .memoryless
let msaaTexture = 
    device.makeTexture(descriptor: textureDescriptor)
renderPassDesc.colorAttachments[0].texture = msaaTexture 
renderPassDesc.colorAttachments[0].loadAction = .clear 
renderPassDesc.colorAttachments[0].storeAction = .

Memory Footprint best practices

  1. Use memoryless render targets.
textureDescriptor.storageMode = .memoryless 
textureDescriptor.usage = [ .shaderRead, .renderTarget ]
// for each G-Buffer texture
textureDescriptor.pixelFormat = gBufferPixelFormats[i] 
gBufferTextures[i] = 
    device.makeTexture(descriptor: textureDescriptor)
renderPassDescriptor.colorAttachments[i].texture = 
renderPassDescriptor.colorAttachments[i].loadAction = .clear 
renderPassDescriptor.colorAttachments[i].storeAction = .dontCare

// for each texture in the cache
// later on...
if (texturePool[i].setPurgeableState(.nonVolatile) == .empty) {
  // regenerate texture

Where to go from here?

Getting the last ounce of performance out of your app is paramount. You’ve had a taste of examining CPU and GPU performance using Instruments, but to go further, you’ll need Apple’s Instruments documentation at

Have a technical question? Want to report a bug? You can ask questions and report bugs to the book authors in our official book forum here.
© 2023 Kodeco Inc.

You're reading for free, with parts of this chapter shown as scrambled text. Unlock this book, and our entire catalogue of books and videos, with a Kodeco Personal Plan.

Unlock now