Chapters

Hide chapters

Expert Swift

First Edition · iOS 14 · Swift 5.4 · Xcode 12.5

7. Strings
Written by Ehab Amer

Heads up... You're reading this book for free, with parts of this chapter shown beyond this point as scrambled text.

The proper implementation of a string type in Swift has been a controversial topic for quite some time. The design is a delicate balance between Unicode correctness, encoding agnosticism, ease-of-use and high-performance. Almost every major release of Swift has refined the String type to the awesome design we have today. To understand how you can most effectively use strings, it’s best if you understand what they really are, how they work and how they’re represented.

In this chapter, you’ll learn:

  • The binary representation of characters, and how it developed over the years
  • The human representation of a string
  • What a grapheme cluster is
  • How Swift works with UTF encodings, and how low-level details of UTF affect String’s performance
  • Ordering of strings in different locales
  • What string folding is and how you can best search in strings
  • What a substring is and how it relates to memory
  • Custom String interpolation and how you can use it to initialize a custom object from a string or convert it to a string

Binary representations

Character representation has changed so much over the years, starting from ASCII (American Standard Code for Information Interchange), which represents English numbers and characters using up to seven bits.

Then, Extended ASCII came along, which used the remaining 128 values representable by a single byte.

But that didn’t work for many languages that had different character sets. So another standard came out, called ANSI. Which is also the name of the entity that created this standard. American National Standards Institute.

Unlike ASCII, ANSI’s not a single character set. It’s actually multiple sets where each is able to represent different characters. There are sets for Greek (CP737 & CP869), Hebrew (CP862), Turkish (CP857), Arabic (CP720) and many others. Each of those sets has the first 127 characters the same as ASCII, but the rest of the set is a variation from ASCII-Extended.

Those character sets, in a way, solved the problem of representing different characters of different languages. But another problem came up! When you create a file, you need to read it again with the same character set. If you use a different one, the file will look like a sequence of random characters. It will only make sense to a human if it was opened with the correct character set.

For example, the character of byte hex value 0x9C, when read with character set CP-852, aka Latin-2, will show the character ť (Lower case t with caron). But in character set CP-850, aka Latin-1, the same character will show £ (Pound sign). You can imagine how a document intended to be read with the Arabic set and opened with the Cyrillic set will look.

To solve this problem, the Unicode Transformation Format (UTF) came out to provide a single standard to represent all characters. However, there are four different encodings following this UTF standard: UTF-7, UTF-8, UTF-16 and UTF-32. Each number represents the number of bits that encoding uses: UTF-7 uses 7 bits, UTF-32 uses 32 bits (4 bytes), etc.

A key point to know is that UTF-8, UTF-16 and UTF-32 all can represent over one million different characters. It is clear that the latter of the group has a large range. As for the first, it’s not limited to 8 bits only — it can expand over 4 bytes. To cover all possible values in the UTF standard requires 21 bits.

UTF-8 binary representation

Each character in UTF-8 varies in size from 1 byte to 4 bytes. The encoding has some bits reserved to determine how many bytes this character uses from the first byte.

1 T X H P H M R

5 8 8 W Y Q V F

3 0 1 0 S K D C

2 9 1 5 9 N K K

0 4 X L V M S T

UTF-16 binary representation

UTF-16 is another variable-length encoding format. A character can be 2 bytes or 4 bytes. Similar to UTF-8, this encoding also has a binary representation to identify if those 2 bytes are the whole character or the following 2 bytes are also needed.

5 7 6 4 4 6 P X X V Y Q S P C S

7 8 5 5 2 8 W H Y N M X J S R K

UTF-32 binary representation

It’s obvious how UTF-32 works. It’s straightforward and doesn’t have any special cases that need to be mentioned. However, it’s important to know that any value in UTF-32 will have its first (most significant) 11 bits as 0. UTF possible values cover only 21 bits, and those 11 bits are never used.

Human representation

Each representable value in a string is named a code point or Unicode Scalar. Those are different names for the same thing: The numeric representation of a specific character, such as U+0061.

Grapheme cluster

Knowing how UTF-8 and UTF-16 work to represent variable sizes, you can imagine that knowing the length of a string isn’t as straightforward as it is for ASCII and ANSI representations. For the latter, an array of 100 bytes is simply 100 characters. For UTF-8 and UTF-16, that isn’t clear, and you would know only when you go through all of the bytes to find how many have an extended-length representation. For UTF-32, this isn’t an issue. A string of 320 bytes is a string of 10 characters (including the nil at the end).

import Foundation

let eAcute = "\u{E9}"
let combinedEAcute = "\u{65}\u{301}"
eAcute.count // 1
combinedEAcute.count // 1
eAcute == combinedEAcute // true
let eAcute_objC: NSString = "\u{E9}"
let combinedEAcute_objC: NSString = "\u{65}\u{301}"

eAcute_objC.length // 1
combinedEAcute_objC.length // 2

eAcute_objC == combinedEAcute_objC // false
let acute = "\u{301}"
let smallE = "\u{65}"

acute.count // 1
smallE.count // 1

let combinedEAcute2 = smallE + acute

combinedEAcute2.count // 1

UTF in Swift

Until Swift 4.2, Swift used UTF-16 as the preferred encoding. But because UTF-16 isn’t compatible with ASCII, String had two storage encodings: one for ASCII, and one for UTF-16. Swift 5 and later versions use only UTF-8 storage encoding.

Collection protocol conformance

String conforms to the two collection protocols: BidirectionalCollection and RangeReplaceableCollection:

var sampleString = "Lo͞r̉em̗ ȉp͇sum̗ do͞l͙o͞r̉ sȉt̕ a͌m̗et̕"

sampleString.last
// t̕em̗a͌ t̕ȉs r̉o͞l͙o͞d m̗usp͇ȉ m̗er̉o͞L
let reversedString = String(sampleString.reversed())

if let rangeToReplace = sampleString.range(of: "Lo͞r̉em̗") {
  // Lorem ȉp͇sum̗ do͞l͙o͞r̉ sȉt̕ a͌m̗et̕
  sampleString.replaceSubrange(rangeToReplace,
     with: "Lorem")
}

extension String {
  subscript(position: Int) -> Self.Element {
    get {
      let characters = Array(self)
      return characters[position]
    }
    set(newValue) {
      let startIndex = self.index(self.startIndex,
        offsetBy: position)
      let endIndex = self.index(self.startIndex,
        offsetBy: position + 1)
      let range = startIndex..<endIndex
      replaceSubrange(range, with: [newValue])
    }
  }
}
sampleString[2] // r
sampleString[2] = "R"

sampleString // LoRem ȉp͇sum̗ do͞l͙o͞r̉ sȉt̕ a͌m̗et̕
for i in 0..<sampleString.count {
  sampleString[i].uppercased()
}
for element in sampleString {
  element.uppercased()
}

String ordering

You’re already well acquainted with string comparison. The default sorting in a string ignores localization preference.

let OwithDiaersis = "Ö"
let zee = "Z"

OwithDiaersis > zee // true

// German 🇩🇪
OwithDiaersis.compare(
  zee,
  locale: Locale(identifier: "DE")) == .orderedAscending // true

// Sweden 🇸🇪
OwithDiaersis.compare(
  zee,
  locale: Locale(identifier: "SE")) == .orderedAscending // false
"11".localizedCompare("2") == .orderedAscending // true

"11".localizedStandardCompare("2") == .orderedAscending // false

String folding

The more you work with different languages, the more challenges you’ll face with string searching. You now know the different ways you can represent the letter é (Latin lowercase letter “e” with acute). But the word "Café" doesn’t match "Cafe":

"Café" == "Cafe" // false
"Café".contains("e") // false
"Café" == "café" // false
"Café".contains("c") // false
let originalString = "H̾e͜l͘l͘ò W͛òr̠l͘d͐!"
originalString.contains("Hello") // false
let foldedString = originalString.folding(
  options: [.caseInsensitive, .diacriticInsensitive],
  locale: .current)
foldedString.contains("hello") // true
originalString.localizedStandardContains("hello") // true

String and Substring in memory

Another tricky point related to performance in String is Substring. Just as how String conforms to StringProtocol, so does Substring.

func doSomething() -> Substring {
  let largeString = "Lorem ipsum dolor sit amet"
  let index = largeString.firstIndex(of: " ") ?? largeString.endIndex
  return largeString[..<index]
}
let subString = doSomething() // Lorem
subString.base // "Lorem ipsum dolor sit amet"
let newString = String(subString)
Zqelexi xadqh! , Qarnu Vekvo Qurtfgihr Wmjedp Httutd Vreguro

Custom string interpolation

String interpolation is a powerful tool for creating strings. But it’s not narrowed to the creation of strings. Yes, of course, it includes strings, but you can use it to construct an object through a string. Yes, I know it’s confusing.

struct Book {
  var name: String
  var authors: [String]
  var fpe: String
}
extension Book: ExpressibleByStringLiteral {
  public init(stringLiteral value: String) {
    let parts = value.components(separatedBy: " by: ")
    let bookName = parts.first ?? ""
    let authorNames = parts.last?.components(separatedBy: ",") ?? []
    self.name = bookName
    self.authors = authorNames
    self.fpe = ""
  }
}
var book: Book = """
Expert Swift by: Ehab Amer,Marin Bencevic,\
Ray Fix,Shai Mishali
"""

book.name // Expert Swift
book.authors.first // Ehab Amer
var invalidBook: Book = """
Book name is `Expert Swift`. \
Written by: Ehab Amer, Marin Bencevic, \
Ray Fix & Shai Mishali
"""

invalidBook.name // Book name is `Expert Swift`. Written
invalidBook.authors.last // Ray Fix & Shai Mishali
extension Book: ExpressibleByStringInterpolation { // 1
  struct StringInterpolation: StringInterpolationProtocol { // 2
    var name: String // 3
    var authors: [String]
    var fpe: String

    init(literalCapacity: Int, interpolationCount: Int) { // 4
      name = ""
      authors = []
      fpe = ""
    }

    mutating func appendLiteral(_ literal: String) { // 5
      // Do something with the literals?
    }

    mutating func appendInterpolation(_ name: String) { // 6
      self.name = name
    }

    mutating func appendInterpolation(
      authors list: [String]) { // 7
      authors = list
    }
  }

  init(stringInterpolation: StringInterpolation) { // 8
    self.authors = stringInterpolation.authors
    self.name = stringInterpolation.name
    self.fpe = stringInterpolation.fpe
  }
}
var interpolatedBook: Book = """
The awesome team of authors \(authors:
  ["Ehab Amer", "Marin Bencevic", "Ray Fix", "Shai Mishali"]) \
wrote this great book. Titled \("Expert Swift")
"""
let stringInterpolation = StringInterpolation(
  literalCapacity: 59,
  interpolationCount: 2)

stringInterpolation.appendLiteral("he awesome team of authors ")

stringInterpolation.appendInterpolation(
  authors: ["Ehab Amer",
            "Marin Bencevic",
            "Ray Fix",
            "Shai Mishali"])

stringInterpolation
  .appendLiteral(" wrote this great book. Titled ")

stringInterpolation
  .appendInterpolation("Expert Swift")

Book(stringInterpolation: stringInterpolation)
extension Book.StringInterpolation {
  mutating func appendInterpolation(fpe name: String) {
    fpe = name
  }
}
var interpolatedBookWithFPE: Book = """
\("Expert Swift") had an amazing \
final pass editor \(fpe: "Eli Ganim")
"""
extension Book.StringInterpolation {
  mutating func appendInterpolation(bookName name: String) {
    self.name = name
  }

  mutating func appendInterpolation(anAuthor name: String) {
    self.authors.append(name)
  }
}
var interpolatedBook2: Book = """
\(anAuthor: "Ray Fix") & \(anAuthor: "Shai Mishali") \
were authors in \(bookName: "Expert Swift")
"""
var num = 1234
var string = "The number is: \(num)"
var string = "\(book)"
// Book(name: "Expert Swift", authors: ["Ehab Amer", "Marin Bencevic", "Ray Fix", "Shai Mishali"], fpe: "")
extension String.StringInterpolation {
  mutating func appendInterpolation(_ book: Book) {
    appendLiteral("The Book \"")
    appendLiteral(book.name)
    appendLiteral("\"")

    if !book.authors.isEmpty {
    appendLiteral(" Authored by: ")
      for author in book.authors {
        if author == book.authors.first {
          appendLiteral(author)
        } else {
          if author == book.authors.last {
            appendLiteral(", & ")
            appendLiteral(author)
            appendLiteral(".")
          } else {
            appendLiteral(", ")
            appendLiteral(author)
          }
        }
      }
    }

    if !book.fpe.isEmpty {
      appendLiteral(" Final Pass Edited by: ")
      appendLiteral(book.fpe)
    }
  }
}
interpolatedBook.fpe = "Eli Ganim"
var string2 = "\(interpolatedBook)"
// The Book "Expert Swift" Authored by: Ehab Amer, Marin Bencevic, Ray Fix, & Shai Mishali. Final Pass Edited by: Eli Ganim

Key points

  • ASCII was the first standard for storing characters, and it evolved to UTF to represent all the possible characters in one single standard.
  • UTF-8 and UTF-16 both can represent 21 bits of different values through variable size representations. A UTF-8 character can take up to 4 bytes.
  • UTF-16 and UTF-32 aren’t backward compatible with ASCII.
  • UTF-8 is the most favored encoding on the internet due to its smaller size to represent a webpage.
  • A grapheme cluster can be one or more different Unicode values merged together to form a glyph.
  • A character in Swift is a grapheme cluster, not a Unicode value. And the same cluster can be represented in different ways. This is called canonical equivalence.
  • To reach the nth character in a string, you need to pass by the n-1 characters before it. It is not an O(1) operation.
  • The order of strings can vary based on the locale.
  • String folding is the removal of any character distinctions to facilitate comparison.
  • Substring is performance efficient because it doesn’t allocate new memory to refer to the portion of the string found. However, this means that the original string is still present in memory.
  • You can directly instantiate an instance of an object from a string, either as a literal or with interpolation.
  • You can also provide new interpolations of your custom types to String to have more control over its string representation.
Have a technical question? Want to report a bug? You can ask questions and report bugs to the book authors in our official book forum here.
© 2024 Kodeco Inc.

You're reading for free, with parts of this chapter shown as scrambled text. Unlock this book, and our entire catalogue of books and videos, with a Kodeco Personal Plan.

Unlock now