Regular Expressions in Kotlin

Learn how to improve your strings manipulation with the power of regular expressions in Kotlin. You’ll love them! By arjuna sky kok.

4.7 (3) · 1 Review

Download materials
Save for later
Share
You are currently viewing page 3 of 4 of this article. Click here to view the first page.

Captured Groups and Back-references

You captured all superheroes infiltrating Supervillains Club with the first name Captain. Now, they’re ready to convert to supervillains, but supervillains can’t use Captain as their name.

Your task now is to extract the last name from the superheroes. Later, Supervillains Club will give them a first name suitable for a supervillain.

To recap, you have to remove Captain from Captain Marvel, then give Marvel to your employer. Later, your employer will give them a different first name, like Dark Marvel. You only need to extract the last name.

Build and run the app. Open http://localhost:8080/extract then click Extract Names. Nothing happens:

Supervillains Club Extracting Names Form

To solve this problem, you’ll still use findAll. But this time, you’ll use a group in the regex string.

In RegexValidator.kt, replace the content of extractNames with:

val pattern = Regex("""Captain\s(\w+)""")
val results = pattern.findAll(names)
return results.map {
  it.groupValues[1]
}.toList()

This code is almost the same as the previous code, but there are two differences:

  1. (\w+): The regex string now has a group.
  2. groupValues[1]: You use groupValues instead of value.

groupValues[1] refers to the (\w+) group in the regex string. Remember that (\w+) is the last name.

What is the number 1 in groupValues[1] exactly? It’s the index of the first group in groupValues array.

You don’t you use index 0 instead because it refers to the full match, such as Captain Marvel. But how big could groupValues be? It depends on the number of the groups in the regex string.

Suppose you have three groups in the regex string:

val pattern = Regex("""((Cap)tain)\s(\w+)""")

If the input string is Captain Marvel:

  • Index 0 refers to Captain Marvel.
  • Index 1 refers to Captain.
  • Index 2 refers to Cap.
  • Index 3 refers to Marvel.

You count the index from the outer groups to inner or nested groups, then from left to right. The first group refers to the full match. The second group refers to ((Cap)tain).

Then you go inside the second group to get the third group. The third group refers to (Cap). Then you move to the right, and the fourth group refers to (\w+).

Build and run the app. Then click Extract Names. You’ll get this result:

Supervillains Club Extracting Names Result

You’ve extracted the last name perfectly. Good job!

You feel proud of your code. It helps supervillains prosper in this wicked world.

But, your employer doesn’t have time to pick a custom first name for the superheroes willing to become a supervillain. They tell you to use a generic first name, Super Evil and be done with it. So Captain Marvel will become Super Evil Marvel.

Open https://localhost:8080/replace and click Replace Names. Nothing happens:

Supervillains Club Turning Superheroes into Supervillains

It’s time to convert these superheroes to supervillains!

To replace strings with regex, you use… guess what? replace. :]

Change the content of replaceNames in RegexValidator.kt with the code below:

val pattern = Regex("""Captain\s(\w+)""")
return pattern.replace(names, "Super Evil $1")

replace accepts two parameters. The first is the string against which you want to match your regex """Captain\s(\w+)""".

The second is the replacement string. It’s Super Evil $1.

The $1 in Super Evil $1 is a special character. $1 is the same as groupValues[1] in the previous example. This is a back-reference.

So the back-reference makes a reference to the captured group. The captured group is (\w+) in Captain\s(\w+).

It’s like you wrote:

val pattern = Regex("""Captain\s(\w+)""")
val results = pattern.findAll(names)
return results.map {
  "Super Evil ${it.groupValues[1]}"
}.joinToString()

But it’s much less code!

Build and run the app. Click Replace Names. You’ll see all superheroes who want to repent got a new first name:

Supervillains Club Turning Superheroes into Supervillains Result

Now with these new names, the superheroes have become supervillains officially!

Understanding Greedy Quantifiers, Possessive Quantifiers and Reluctant Quantifiers

Supervillains Club throws you another task. All supervillains have diet plans. The nutritionist in Supervillains Club has made a plan tailored for supervillains.

Open http://localhost:8080/diet and you’ll see a diet plan for supervillains in HTML format:

Supervillains Diet Plan Form

The data scientists ask you to extract the diet plan from the HTML file. In other words, you want to extract an array of the meals from the HTML string: 5kg Unicorn Meat, 2L Lava, 2kg Meteorite.

You need to match strings between the li tags. The strings could be anything. How do you match strings that can be anything?

You use . to represent any character in regex. Any character means any characters in the universe, with one exception.

. can match the line terminators or not depending on the configuration of the regex. But you don’t need to worry about this in this tutorial.

You know the ?, * and + quantifiers. These are called greedy quantifiers. You’ll know why they’re greedy soon!

What happens if you join . and *? They match any characters or any strings!

Interestingly, you can add the ? or + quantifiers to .*. The quantifiers alter the behavior of .*. You’ll experiment with all of them.

Using Greedy Quantifiers

First, you’ll use the greedy quantifier, .*.

In RegexValidator.kt, replace the content of the extractNamesFromHtml with:

val pattern = Regex("""<li>(.*)</li>""")
val results = pattern.findAll(names)
return results.map {
  it.groupValues[1]
}.toList()

Here, you use the method you used previously, findAll. The logic is simple: You use a group to capture the string between the li tags. Then you use groupValues when extracting the string.

Build and run the app, then submit the form. The result isn’t something you expect:

Supervillains Diet Plan Greedy Quantifier Result

You got a one-item array, not a three-item array. The (.*) regex pattern swallowed the </li> strings as well except the last one.

That’s why people call this quantifier greedy. It tries to match the string as much as possible while still getting the correct full match result.

But there’s another quantifier that’s greedier than the greedy quantifier: the possessive quantifier.

Using Possessive Quantifiers

Now, replace the content of extractNamesFromHtml with:

val pattern = Regex("""<li>(.*+)</li>""")
val results = pattern.findAll(names)
return results.map {
  it.groupValues[1]
}.toList()

Notice that the difference is you put + on the right of .*. This is a possessive quantifier.

Build and run the app. Then submit the form:

Supervillains Diet Plan Possessive Quantifier Result

The result is empty. The regex pattern failed to match the string because .*+ in <li>(.*+)</li> matches 5kg Unicorn Meat</li><li>2L Lava</li><li>2kg Meteorite</li>. So by the time the regex pattern moves to </li> in <li>(.*+)</li>, it can’t match the string because there is nothing to match.

What you want is a reluctant quantifier.