Person Segmentation in the Vision Framework

Learn how to use person segmentation via the Vision framework. By Vidhur Voora.

5 (4) · 1 Review

Download materials
Save for later

Computer Vision has gained more prominence than ever before. Its applications include cancer detection, cell classification, traffic flow analysis, real-time sports analysis and many more. Apple introduced the Vision framework as part of iOS 11. It allows you to perform various tasks, such as face tracking, barcode detection and image registration. In iOS 15, Apple introduced an API in the Vision framework to perform person segmentation, which also powers the Portrait mode.

In this tutorial, you’ll learn:

  • What image segmentation is and the different types of segmentation.
  • Created a person segmentation for a photo.
  • Understand the different quality levels and performance tradeoffs.
  • Created person segmentation for live video capture.
  • Other frameworks that provide person segmentation.
  • Best practices for person segmentation.
Note: This tutorial assumes a working knowledge of SwiftUI, UIKit and AVFoundation. For more information about SwiftUI, see SwiftUI: Getting Started. You’ll also need a physical iOS 15 device to follow along.

Getting Started

Download the project by clicking Download Materials at the top or bottom of this page. Open RayGreetings in starter. Build and run on a physical device.

Build and run starter project

You’ll see two tabs: Photo Greeting and Video Greeting. The Photo Greeting tab will show you a nice background image and a family picture. In this tutorial, you’ll use person segmentation to overlay family members on the greeting background. Tap the Video Greeting tab and grant the camera permissions. You’ll see the camera feed displayed. The starter project is set up to capture and display the camera frames. You’ll update the live frames to generate a video greeting!

Before you dive into implementing these, you need to understand what person segmentation is. Get ready for a fun ride.

Introducing Image Segmentation

Image segmentation divides an image into segments and processes them. It gives a more granular understanding of the image. Object detection provides a bounding box of the desired object in an image, whereas image segmentation provides a pixel mask for the object.

There are two types of image segmentation: semantic segmentation and instance segmentation.

Semantic segmentation is the process of detecting and grouping together similar parts of the image that belong to the same class. Instance segmentation is the process of detecting a specific instance of the object. When you apply semantic segmentation to an image with people, it generates one mask that contains all the people. Instance segmentation generates an individual mask for each person in the image.

The person segmentation API provided in Apple’s Vision framework is a single-frame API. It uses semantic segmentation to provide a single mask for all people in a frame. It’s used for both stream and offline processing.

The process of person segmentation has four steps:

  1. Creating a person segmentation request.
  2. Creating a request handler for that request.
  3. Processing the request.
  4. Handling the result.

Next, you’ll use the API and these steps to create a photo greeting!

Creating Photo Greeting

You have an image of a family and an image with a festive background. Your goal is to overlay the people in the family picture over the festive background to generate a fun greeting.

Open RayGreetings and open GreetingProcessor.swift.

Add the following below import Combine:

import Vision

This imports the Vision framework. Next, add the following to GreetingProcessor below @Published var photoOutput = UIImage():

let request = VNGeneratePersonSegmentationRequest()

Here, you create an instance of the person segmentation request. This is a stateful request and can be reused for an entire sequence of frames. This is especially useful when processing videos offline and for live camera capture.

Next, add the following to GreetingProcessor:

func generatePhotoGreeting(greeting: Greeting) {
  // 1
    let backgroundImage = greeting.backgroundImage.cgImage,
    let foregroundImage = greeting.foregroundImage.cgImage else {
    print("Missing required images")
  // 2
  // Create request handler
  let requestHandler = VNImageRequestHandler(
    cgImage: foregroundImage,
    options: [:])
  // TODO

Here’s what the code above is doing:

  1. Accesses cgImage from backgroundImage and foregroundImage. Then, it ensures both the images are valid. You’ll be using them soon to blend the images using Core Image.
  2. Creates requestHandler as an instance of VNImageRequestHandler. It takes in an image along with an optional dictionary that specifies how to process the image.

Next, replace // TODO with the following:

do {
  // 1
  try requestHandler.perform([request])
  // 2
  guard let mask = request.results?.first else {
    print("Error generating person segmentation mask")
  // 3
  let foreground = CIImage(cgImage: foregroundImage)
  let maskImage = CIImage(cvPixelBuffer: mask.pixelBuffer)
  let background = CIImage(cgImage: backgroundImage)
  // TODO: Blend images
} catch {
  print("Error processing person segmentation request")

Here’s a breakdown of the code above:

  1. requestHandler processes the person segmentation request using perform(_:). If multiple requests are present, it returns after all the requests have been either completed or failed. perform(_:) can throw an error while processing the request, so you handle it by enclosing it in a do-catch.
  2. You then retrieve the mask from the results. Because you submitted only one request, you retrieve the first object from the results.
  3. The pixelBuffer property of the returned result has the mask. You then create the CIImage versions of the foreground, background and mask. The CIImage is the representation of an image that the Core Image filter will process. You’ll need this to blend the images.

Blending All the Images

Add the following in GreetingProcessor.swift below import Vision:

import CoreImage.CIFilterBuiltins

Core Image provides methods that give type-safe instances of CIFilter. Here, you import CIFilterBuiltins to access its type-safe APIs.

Next, add the following to GreetingProcessor:

func blendImages(
  background: CIImage,
  foreground: CIImage,
  mask: CIImage
) -> CIImage? {
  // 1
  let maskScaleX = foreground.extent.width / mask.extent.width
  let maskScaleY = foreground.extent.height / mask.extent.height
  let maskScaled = mask.transformed(
    by: __CGAffineTransformMake(maskScaleX, 0, 0, maskScaleY, 0, 0))
  // 2
  let backgroundScaleX = (foreground.extent.width / background.extent.width)
  let backgroundScaleY = (foreground.extent.height / background.extent.height)
  let backgroundScaled = background.transformed(
    by: __CGAffineTransformMake(backgroundScaleX,
    0, 0, backgroundScaleY, 0, 0))
  // 3
  let blendFilter = CIFilter.blendWithMask()
  blendFilter.inputImage = foreground
  blendFilter.backgroundImage = backgroundScaled
  blendFilter.maskImage = maskScaled
  // 4
  return blendFilter.outputImage

The code above:

  1. Calculates the X and Y scales of the mask with respect to the foreground image. It then uses CGAffineTransformMake to scale the mask size to the foreground image.
  2. Like the scaling of mask, it calculates the X and Y scales of background and then scales background to the size of foreground.
  3. Creates blendFilter, which is a Core Image filter. It then sets the inputImage of the filter to the foreground. The backgroundImage and the maskImage of the filter are set to the scaled versions of the image.
  4. outputImage contains the result of the of the blend.

The returned result is of the type CIImage. You’ll need to convert this to a UIImage to display it in the UI.

In GreetingProcessor, add the following at the top, below let request = VNGeneratePersonSegmentationRequest():

let context = CIContext()

Here, you create an instance of CIContext. It’s used to create a Quartz 2D image from a CIImage object.

Add the following to GreetingProcessor:

private func renderAsUIImage(_ image: CIImage) -> UIImage? {
  guard let cgImage = context.createCGImage(image, from: image.extent) else {
    return nil
  return UIImage(cgImage: cgImage)

Here, you use context to create an instance of CGImage from CIImage.

Using cgImage, you then create a UIImage. The user will see that image.