Metal Rendering Pipeline Tutorial

Take a deep dive through the rendering pipeline and create a Metal app that renders primitives on screen, in this excerpt from our book, Metal by Tutorials! By Marius Horga.

Leave a rating/review
Download materials
Save for later
Share
You are currently viewing page 4 of 5 of this article. Click here to view the first page.

3 – Primitive Assembly

The previous stage sent processed vertices grouped into blocks of data to this stage. The important thing to keep in mind is that vertices belonging to the same geometrical shape (primitive) are always in the same block. That means that the one vertex of a point, or the two vertices of a line, or the three vertices of a triangle, will always be in the same block, hence a second block fetch will never be necessary.

Along with vertices, the CPU also sends vertex connectivity information when it issues the draw call command, like this:

renderEncoder.drawIndexedPrimitives(type: .triangle,
                          indexCount: submesh.indexCount,
                          indexType: submesh.indexType,
                          indexBuffer: submesh.indexBuffer.buffer,
                          indexBufferOffset: 0)

The first argument of the draw function contains the most important information about vertex connectivity. In this case, it tells the GPU that it should draw triangles from the vertex buffer it sent.

The Metal API provides five primitive types:

  • point: For each vertex rasterize a point. You can specify the size of a point that has the attribute [[point_size]] in the vertex shader.
  • line: For each pair of vertices rasterize a line between them. If a vertex was already included in a line, it cannot be included again in other lines. The last vertex is ignored if there are an odd number of vertices.
  • lineStrip: Same as a simple line except that the line strip connects all adjacent vertices and forms a poly-line. Each vertex (except the first) is connected to the previous vertex.
  • triangle: For every sequence of three vertices rasterize a triangle. The last vertices are ignored if they cannot form another triangle.
  • triangleStrip: Same as a simple triangle except adjacent vertices can be connected to other triangles as well.

There is one more primitive type called a patch but this needs a special treatment and cannot be used with the indexed draw call function.

The pipeline specifies the winding order of the vertices. If the winding order is counter-clockwise, and the triangle vertex order is counter-clockwise, it means they are front-faced. Otherwise, they are back-faced and can be culled since we cannot see their color and lighting.

Primitives will be culled when they are totally occluded by other primitives, however, when they are only partially off-screen, they’ll be clipped.

For efficiency, you should specify winding order and enable back-face culling.

At this point, primitives are fully assembled from connected vertices and they move on to the rasterizer.

4 – Rasterization

There are two modern rendering techniques currently evolving on separate paths but sometimes used together: ray tracing and rasterization. They are quite different; both have pros and cons.

Ray tracing is preferred when rendering content that is static and far away, while rasterization is preferred when the content is closer to the camera and more dynamic.

With ray tracing, for each pixel on the screen, it sends a ray into the scene to see if there’s an intersection with an object. If yes, change the pixel color to that object’s color, but only if the object is closer to the screen than the previously saved object for the current pixel.

Rasterization works the other way around: for each object in the scene, send rays back into the screen and check which pixels are covered by the object. Depth information is kept the same way as for ray tracing, so it will update the pixel color if the current object is closer than the previously saved one.

At this point, all connected vertices sent from the previous stage need to be represented on a two-dimensional grid using their X and Y coordinates. This step is known as the triangle setup.

Here is where the rasterizer needs to calculate the slope or steepness of the line segments between any two vertices. When the three slopes for the three vertices are known, the triangle can be formed from these three edges.

Next, a process called scan conversion runs on each line of the screen to look for intersections and to determine what is visible and what is not. To draw on the screen at this point, only the vertices and the slopes they determine are needed. The scan algorithm determines if all the points on a line segment, or all the points inside of a triangle are visible, in which case the triangle is filled with color entirely.

For mobile devices, the rasterization takes advantage of the tiled architecture of PowerVR GPUs by rasterizing the primitives on a 32×32 tile grid in parallel. In this case, 32 is the number of screen pixels assigned to a tile but this size perfectly fits the number of cores in a USC.

What if one object is behind another object? How can the rasterizer determine which object to render? This hidden surface removal problem can be solved by using stored depth information (early-Z testing) to determine whether each point is in front of other points in the scene.

After rasterization is finished, three more specialized hardware units take the stage:

  • A buffer called Hierarchical-Z is responsible for removing fragments that were marked for culling by the rasterizer.
  • The Z and Stencil Test unit then removes non-visible fragments by comparing them against the depth and stencil buffer.
  • Finally, the Interpolator unit takes the remaining visible fragments and generates fragment attributes from the assembled triangle attributes.

At this point, the Scheduler unit again dispatches work to the shader cores, but this time it’s the rasterized fragments sent for Fragment Processing.

5 – Fragment Processing

Time for a quick review of the pipeline.

  • The Vertex Fetch unit grabs vertices from the memory and passes them to the Scheduler unit.
  • The Scheduler unit knows which shader cores are available so it dispatches work on them.
  • After work is done, the Distributer unit knows if this work was Vertex or Fragment Processing.
  • If it was Vertex Processing work, it sends the result to the Primitive Assembly unit. This path continues to the Rasterization unit and then back to the Scheduler unit.
  • If it was Fragment Processing work, it sends the result to the Color Writing unit.
  • Finally, the colored pixels are sent back to the memory.

The primitive processing in the previous stages was sequential because there is only one Primitive Assembly unit and one Rasterization unit. However, as soon as fragments reach the Scheduler unit, work can be forked (divided) into many tiny parts, and each part is given to an available shader core.

Hundreds or even thousands of cores are now doing parallel processing. When the work is finished, the results will be joined (merged) and sent to the memory, again sequentially.

The fragment processing stage is another programmable stage. You create a fragment shader function that will receive the lighting, texture coordinate, depth and color information that the vertex function output.

The fragment shader output is a single color for that fragment. Each of these fragments will contribute to the color of the final pixel in the framebuffer. All the attributes are interpolated for each fragment.

For example, to render this triangle, the vertex function would process three vertices with the colors red, green and blue. As the diagram shows, each fragment that makes up this triangle is interpolated from these three colors. Linear interpolation simply averages the color at each point on the line between two endpoints. If one endpoint has red color, and the other has green color, the midpoint on the line between them will be yellow. And so on.

The interpolation equation is parametric and has this form, where parameter p is the percentage (or a range from 0 to 1) of a color’s presence:

newColor = p * oldColor1 + (1 - p) * oldColor2

Color is easy to visualize, but all the other vertex function outputs are also similarly interpolated for each fragment.

Note: If you don’t want a vertex output to be interpolated, add the attribute [[ flat ]] to its definition.

In Shaders.Metal, add the fragment function to the end of the file:

fragment float4 fragment_main() {
  return float4(1, 0, 0, 1);
}

This is the simplest fragment function possible. You return the interpolated color red in the form of a float4. All the fragments that make up the cube will be red.

The GPU takes the fragments and does a series of post-processing tests:

  • alpha-testing determines which opaque objects are drawn and which are not based on depth testing.
  • In the case of translucent objects, alpha-blending will combine the color of the new object with that already saved in the color buffer previously.
  • scissor testing checks whether a fragment is inside of a specified rectangle; this test is useful for masked rendering.
  • stencil testing checks how the stencil value in the framebuffer where the fragment is stored, compares to a specified value we choose.
  • In the previous stage early-Z testing ran; now a late-Z testing is done to solve more visibility issues; stencil and depth tests are also useful for ambient occlusion and shadows.
  • Finally, antialiasing is also calculated here so that final images that get to the screen do not look jagged.