Person Segmentation in the Vision Framework

Learn how to use person segmentation via the Vision framework. By Vidhur Voora.

Login to leave a rating/review
Download materials
Save for later
Share

Learn how to use person segmentation via the Vision framework.

Computer Vision has gained more prominence than ever before. Its applications include cancer detection, cell classification, traffic flow analysis, real-time sports analysis and many more. Apple introduced the Vision framework as part of iOS 11. It allows you to perform various tasks, such as face tracking, barcode detection and image registration. In iOS 15, Apple introduced an API in the Vision framework to perform person segmentation, which also powers the Portrait mode.

In this tutorial, you’ll learn:

  • What image segmentation is and the different types of segmentation.
  • Created a person segmentation for a photo.
  • Understand the different quality levels and performance tradeoffs.
  • Created person segmentation for live video capture.
  • Other frameworks that provide person segmentation.
  • Best practices for person segmentation.
Note: This tutorial assumes a working knowledge of SwiftUI, UIKit and AVFoundation. For more information about SwiftUI, see SwiftUI: Getting Started. You’ll also need a physical iOS 15 device to follow along.

Getting Started

Download the project by clicking Download Materials at the top or bottom of this page. Open RayGreetings in starter. Build and run on a physical device.

Build and run starter project

You’ll see two tabs: Photo Greeting and Video Greeting. The Photo Greeting tab will show you a nice background image and a family picture. In this tutorial, you’ll use person segmentation to overlay family members on the greeting background. Tap the Video Greeting tab and grant the camera permissions. You’ll see the camera feed displayed. The starter project is set up to capture and display the camera frames. You’ll update the live frames to generate a video greeting!

Before you dive into implementing these, you need to understand what person segmentation is. Get ready for a fun ride.

Introducing Image Segmentation

Image segmentation divides an image into segments and processes them. It gives a more granular understanding of the image. Object detection provides a bounding box of the desired object in an image, whereas image segmentation provides a pixel mask for the object.

There are two types of image segmentation: semantic segmentation and instance segmentation.

Semantic segmentation is the process of detecting and grouping together similar parts of the image that belong to the same class. Instance segmentation is the process of detecting a specific instance of the object. When you apply semantic segmentation to an image with people, it generates one mask that contains all the people. Instance segmentation generates an individual mask for each person in the image.

The person segmentation API provided in Apple’s Vision framework is a single-frame API. It uses semantic segmentation to provide a single mask for all people in a frame. It’s used for both stream and offline processing.

The process of person segmentation has four steps:

  1. Creating a person segmentation request.
  2. Creating a request handler for that request.
  3. Processing the request.
  4. Handling the result.

Next, you’ll use the API and these steps to create a photo greeting!

Creating Photo Greeting

You have an image of a family and an image with a festive background. Your goal is to overlay the people in the family picture over the festive background to generate a fun greeting.

Open RayGreetings and open GreetingProcessor.swift.

Add the following below import Combine:

import Vision

This imports the Vision framework. Next, add the following to GreetingProcessor below @Published var photoOutput = UIImage():

let request = VNGeneratePersonSegmentationRequest()

Here, you create an instance of the person segmentation request. This is a stateful request and can be reused for an entire sequence of frames. This is especially useful when processing videos offline and for live camera capture.

Next, add the following to GreetingProcessor:

func generatePhotoGreeting(greeting: Greeting) {
  // 1
  guard 
    let backgroundImage = greeting.backgroundImage.cgImage,
    let foregroundImage = greeting.foregroundImage.cgImage else {
    print("Missing required images")
    return
  }
 
  // 2
  // Create request handler
  let requestHandler = VNImageRequestHandler(
    cgImage: foregroundImage,
    options: [:])
 
  // TODO
}

Here’s what the code above is doing:

  1. Accesses cgImage from backgroundImage and foregroundImage. Then, it ensures both the images are valid. You’ll be using them soon to blend the images using Core Image.
  2. Creates requestHandler as an instance of VNImageRequestHandler. It takes in an image along with an optional dictionary that specifies how to process the image.

Next, replace // TODO with the following:

do {
  // 1
  try requestHandler.perform([request])
 
  // 2
  guard let mask = request.results?.first else {
    print("Error generating person segmentation mask")
    return
  }
 
  // 3
  let foreground = CIImage(cgImage: foregroundImage)
  let maskImage = CIImage(cvPixelBuffer: mask.pixelBuffer)
  let background = CIImage(cgImage: backgroundImage)
 
  // TODO: Blend images
} catch {
  print("Error processing person segmentation request")
}

Here’s a breakdown of the code above:

  1. requestHandler processes the person segmentation request using perform(_:). If multiple requests are present, it returns after all the requests have been either completed or failed. perform(_:) can throw an error while processing the request, so you handle it by enclosing it in a do-catch.
  2. You then retrieve the mask from the results. Because you submitted only one request, you retrieve the first object from the results.
  3. The pixelBuffer property of the returned result has the mask. You then create the CIImage versions of the foreground, background and mask. The CIImage is the representation of an image that the Core Image filter will process. You’ll need this to blend the images.

Blending All the Images

Add the following in GreetingProcessor.swift below import Vision:

import CoreImage.CIFilterBuiltins

Core Image provides methods that give type-safe instances of CIFilter. Here, you import CIFilterBuiltins to access its type-safe APIs.

Next, add the following to GreetingProcessor:

func blendImages(
  background: CIImage,
  foreground: CIImage,
  mask: CIImage
) -> CIImage? {
  // 1
  let maskScaleX = foreground.extent.width / mask.extent.width
  let maskScaleY = foreground.extent.height / mask.extent.height
  let maskScaled = mask.transformed(
    by: __CGAffineTransformMake(maskScaleX, 0, 0, maskScaleY, 0, 0))
 
  // 2
  let backgroundScaleX = (foreground.extent.width / background.extent.width)
  let backgroundScaleY = (foreground.extent.height / background.extent.height)
  let backgroundScaled = background.transformed(
    by: __CGAffineTransformMake(backgroundScaleX,
    0, 0, backgroundScaleY, 0, 0))
 
  // 3
  let blendFilter = CIFilter.blendWithMask()
  blendFilter.inputImage = foreground
  blendFilter.backgroundImage = backgroundScaled
  blendFilter.maskImage = maskScaled
 
  // 4
  return blendFilter.outputImage
}

The code above:

  1. Calculates the X and Y scales of the mask with respect to the foreground image. It then uses CGAffineTransformMake to scale the mask size to the foreground image.
  2. Like the scaling of mask, it calculates the X and Y scales of background and then scales background to the size of foreground.
  3. Creates blendFilter, which is a Core Image filter. It then sets the inputImage of the filter to the foreground. The backgroundImage and the maskImage of the filter are set to the scaled versions of the image.
  4. outputImage contains the result of the of the blend.

The returned result is of the type CIImage. You’ll need to convert this to a UIImage to display it in the UI.

In GreetingProcessor, add the following at the top, below let request = VNGeneratePersonSegmentationRequest():

let context = CIContext()

Here, you create an instance of CIContext. It’s used to create a Quartz 2D image from a CIImage object.

Add the following to GreetingProcessor:

private func renderAsUIImage(_ image: CIImage) -> UIImage? {
  guard let cgImage = context.createCGImage(image, from: image.extent) else {
    return nil
  }
  return UIImage(cgImage: cgImage)
}

Here, you use context to create an instance of CGImage from CIImage.

Using cgImage, you then create a UIImage. The user will see that image.

Displaying the Photo Greeting

Replace // TODO: Blend images in generatePhotoGreeting(greeting:) and add the following:

// 1
guard let output = blendImages(
  background: background,
  foreground: foreground,
  mask: maskImage) else {
    print("Error blending images")
    return
  }
 
// 2
if let photoResult = renderAsUIImage(output) {
  self.photoOutput = photoResult
}

Here’s what’s happening:

  1. blendImages(background:foreground:mask:) blends the images and ensures the output isn’t nil.
  2. Then, you convert the output to an instance of a UIImage and set it to photoOutput. photoOutput is a published property. It’s accessed to display the output in PhotoGreetingView.swift.

As a last step, open PhotoGreetingView.swift. Replace // TODO: Generate Photo Greeting in the action closure of Button with the following:

GreetingProcessor.shared.generatePhotoGreeting(greeting: greeting)

Here, you call generatePhotoGreeting(greeting:) to generate the greeting when Button is tapped.

Build and run on a physical device. Tap Generate Photo Greeting.

Photo greeting generated with picture of people put on background

Voila! You’ve now added a custom background to your family pic. It’s time to send that to your friends and family. :]

By default, you get the best quality person segmentation. It does have a high processing cost and might not be suitable for all real-time scenarios. It’s essential to know the different quality and performance options available. You’ll learn that next.

Quality and Performance Options

The person segmentation request you created earlier has a default quality level of VNGeneratePersonSegmentationRequest.QualityLevel.accurate.

You can choose from three quality levels:

  • accurate: Ideal in the scenario where you want to get the highest quality and aren’t constrained by time.
  • balanced: Ideal for processing frames for video.
  • fast: Best suited for processing streaming content.

Quality level comparison table

The quality of the generated mask depends on the quality level set.

Mask comparison

Notice that as the quality level increases, the quality of the mask looks much better. Accurate quality shows more granular details in the mask. The frame size, memory and time to process vary depending on the quality level.

Performance comparison table for different quality levels

The frame size for the accurate level is a whopping 64x compared to the fast quality level. The memory and the time to process for an accurate level are much higher when compared to the fast and balanced levels. This represents the trade-off on the quality of the mask and the resources needed to generate it.

Now that you know the trade-off, it’s time to generate a fun video greeting! :]

Creating Video Greeting

Open CameraViewController.swift. It has all the functionality set up to capture camera frames and render them using Metal. To learn more about setting up a camera with AVFoundation and SwiftUI, check out this tutorial and this video series.

Check out the logic in CameraViewController, which conforms to AVCaptureVideoDataOutputSampleBufferDelegate.

extension CameraViewController: AVCaptureVideoDataOutputSampleBufferDelegate {
  func captureOutput(_ output: AVCaptureOutput,
                     didOutput sampleBuffer: CMSampleBuffer,
                     from connection: AVCaptureConnection) {
    // Grab the pixelbuffer frame from the camera output
    guard let pixelBuffer = sampleBuffer.imageBuffer else {
      return
    }
    self.currentCIImage = CIImage(cvPixelBuffer: pixelBuffer)
  }
}

Here, notice that pixelBuffer is retrieved from sampleBuffer. It’s then rendered by updating currentCIImage. Your goal is to use this pixelBuffer as the foreground image and create a video greeting.

Open GreetingProcessor.swift and add the following to GreetingProcessor:

func processVideoFrame(
  foreground: CVPixelBuffer,
  background: CGImage
) -> CIImage? {
  let ciForeground = CIImage(cvPixelBuffer: foreground)

  // TODO: person segmentation request

  return nil
}

Here, you create an instance of CIImage from the foreground CVPixelBuffer so you can blend the images using Core Image filter.

So far, you’ve used the Vision framework to create, process and handle the person segmentation request. Although it’s easy to use, other frameworks offer similar functionality powered by the same technology. You’ll learn this next.

Alternatives for Generating Person Segmentation

You can use these frameworks as alternatives to Vision for generating a person segmentation mask:

  • AVFoundation: Can generate a person segmentation mask on certain newer devices when capturing a photo. You can get the mask via the portraitEffectsMatte property of AVCapturePhoto.
  • ARKit: Generates the segmentation mask when processing the camera feed. You can get the mask using the segmentationBuffer property of ARFrame. It’s supported on devices that have A12 Bionic and later.
  • Core Image: Core Image provides a thin wrapper over the Vision framework. It exposes the qualityLevel property you set for VNGeneratePersonSegmentationRequest.

Next, you’ll use Core Image to generate a person segmentation mask for the video greeting.

Using Core Image to Generate Person Segmentation Mask

Replace // TODO: person segmentation request in processVideoFrame(foreground:background:) with the following:

// 1
let personSegmentFilter = CIFilter.personSegmentation()
personSegmentFilter.inputImage = ciForeground
personSegmentFilter.qualityLevel = 1
 
// 2
if let mask = personSegmentFilter.outputImage {
  guard let output = blendImages(
    background: CIImage(cgImage: background),
    foreground: ciForeground,
    mask: mask) else {
      print("Error blending images")
      return nil
    }
  return output
}

Here’s what that does:

  1. Creates personSegmentFilter using Core Image’s CIFilter and sets inputImage with the foreground image. The qualityLevel takes in a number. The different quality level options are:
    • 0: Accurate
    • 1 Balanced
    • 2: Fast

    Here, you set qualityLevel to 1.

  2. Fetches the mask from outputImage of personSegmentationFilter and ensures it’s not nil. Then, it uses blendImages(background:foreground:mask:) to blend the images and return the result.

Open CameraViewController.swift. Replace the contents of captureOutput(_:didOutput:from:) in CameraViewController extension with the following:

// 1
guard 
  let pixelBuffer = sampleBuffer.imageBuffer,
  let backgroundImage = self.background?.cgImage else {
  return
}
 
// 2
DispatchQueue.global().async {
  if let output = GreetingProcessor.shared.processVideoFrame(
    foreground: pixelBuffer,
    background: backgroundImage) {
    DispatchQueue.main.async {
      self.currentCIImage = output
    }
  }
}

Here’s a breakdown of the code above. It:

  1. Checks that pixelBuffer and backgroundImage are valid.
  2. Processes the video frame asynchronously by calling processVideoFrame(foreground:background:) defined in GreetingProcessor. Then, it updates currentCIImage with the output.

Build and run on a physical device. Tap the Video Greeting tab.

Attempt to generate video greeting

Oh no! There’s no visible camera stream. What happened?

Open GreetingProcessor.swift and put a breakpoint at guard let output = blendImages in processVideoFrame(foreground:background:). Notice the mask generated using Quick Look in the debugger.

View mask using Quick Look debugger

The mask is red! You’ll need to create a Blend filter using the red mask instead of the regular white mask.

Update blendImages(background:foreground:mask:) to take a new Boolean parameter as shown below:

func blendImages(
  background: CIImage,
  foreground: CIImage,
  mask: CIImage,
  isRedMask: Bool = false
) -> CIImage? {

This uses isRedMask to determine the type of blend filter to generate. By default, its value is false.

Replace let blendFilter = CIFilter.blendWithMask() in blendImages(background:foreground:mask:isRedMask:) as shown below:

let blendFilter = isRedMask ?
CIFilter.blendWithRedMask() :
CIFilter.blendWithMask()

Here, you generate blendFilter with a red mask if isRedMask is true. Otherwise, you create with a white mask.

Next, replace:

guard let output = blendImages(
  background: CIImage(cgImage: background),
  foreground: ciForeground,
  mask: mask) else { 

in processVideoFrame(foreground:background:) with the following:

guard let output = blendImages(
  background: CIImage(cgImage: background),
  foreground: ciForeground,
  mask: mask,
  isRedMask: true) else {

Here, you specify to generate the blend filter with a red mask.

Build and run on a physical device. Tap Video Greeting and point the front camera toward you.

Video greeting generated with person waving in front of a Happy Thanksgiving background

You now see your image overlaid on a friendly greeting. Great job creating a video greeting!

You can now create a Zoom blur background filter. :]

Understanding Best Practices

While person segmentation worked in photo and video greetings, here are some best practices to keep in mind:

  • Try to segment a maximum of four people in a scene and ensure all are visible.
  • A person's height should be at least half the image height.
  • Avoid the following ambiguities in a frame:
    • Statues
    • Long distance

Where to Go From Here?

Download the completed version of the project using the Download Materials button at the top or bottom of this tutorial.

To learn more, check out this WWDC video: Detect people, faces, and poses using Vision

I hope you enjoyed this tutorial. Please join the forum discussion below if you have any questions or comments.