Hannikainen's blog

Mutation testing

You’re doing some customer work, and you get a request to implement a function, which would validate whether some person can drink alcohol at a bar. The answer naturally depends on the age of the person and the country where the bar is. When you interrogated the customer a bit more, you discover that the code should give correct answers for Finland, United States and Germany, where the age limits are 18, 21 and 16 respectively.

You boot up your computer and start drafting a solution with Java:

$ mvn archetype:generate -DgroupId=fi.bytecraft.mutations -DartifactId=mutations -DarchetypeArtifactId=maven-archetype-quickstart -DarchetypeVersion=1.4 -DinteractiveMode=false
package fi.bytecraft.mutations;

public class App
{
    public static void main(String[] args) {}

    enum CountryCode {
        FI, /* Finland */
        US, /* United States */
        DE /* Germany */
    }

    public static boolean canDrinkAlcohol(int age, CountryCode country) {
        if (country == CountryCode.FI && age <= 17) {
            return false;
        } else if (country == CountryCode.US && age < 20) {
            return false;
        } else if (country == CountryCode.DE && age <= 15) {
            return false;
        }
        return true;
    }
}

Every codebase naturally requires tests. Because the deadline for the feature is specified to be “yesterday”, you quickly guess some tests cases and write those without thinking further:

package fi.bytecraft.mutations;

import static org.junit.Assert.assertTrue;
import static org.junit.Assert.assertFalse;

import org.junit.Test;

public class AppTest
{
    @Test
    public void TestAlcoholLegalAges()
    {
        assertFalse(App.canDrinkAlcohol(17, App.CountryCode.FI));
        assertTrue(App.canDrinkAlcohol(18, App.CountryCode.FI));
        assertFalse(App.canDrinkAlcohol(18, App.CountryCode.US));
        assertTrue(App.canDrinkAlcohol(21, App.CountryCode.US));
        assertFalse(App.canDrinkAlcohol(15, App.CountryCode.DE));
        assertTrue(App.canDrinkAlcohol(16, App.CountryCode.DE));
    }
}

The code seems to work:

$ mvn test
[... a lot of text ...]
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.02 s - in fi.bytecraft.mutations.AppTest
[... a lot of text ...]

You’re not quite sure whether the tests test everything, so you add a code coverage tool to measure the tests:

<plugin>
  <groupId>org.jacoco</groupId>
  <artifactId>jacoco-maven-plugin</artifactId>
  <version>0.8.7</version>
</plugin>
$ mvn jacoco:prepare-agent install jacoco:report
[... a lot of text ...]
$ xdg-open xdg-open target/site/jacoco/index.html

You take a peek at the report given by the tool. Since every line seems green (which supposedly means tested), you click “Deploy”…

…and in about two weeks the customer calls you angrily, and tells that some minor has been served alcohol 1. What went wrong?

Mutation testing

Mutation testing means using test tools, which modify the system under test before running the tests. The code changes should break the code, so that the tests will fail. If all of the tests succeed, the mutation is said to ‘survive’ (which is bad), and if any test fails, the mutation is said to be ‘killed’ (which is good). After this, the tool reports mutation test coverage, where green lines mean killed mutations, and red ones surviving mutations. The mutations mean for example changing the < operator into >, after which some test should fail, as the behaviour of the code has changed.

Let’s take the max(a,b) function as an example, which is implemented as max(a,b) = a > b ? a : b. If the > -operator is changed into <, the code doesn’t do the same thing anymore, so some tests should spot that.

Java has a library called pitest, which enables running existing tests as mutation tests. Let’s enable it:

<plugin>
  <groupId>org.pitest</groupId>
  <artifactId>pitest-maven</artifactId>
  <version>LATEST</version>
  <configuration>
    <mutators>
      <mutator>DEFAULTS</mutator>
    </mutators>
  </configuration>
</plugin>
$ mvn test-compile org.pitest:pitest-maven:mutationCoverage

The terminal is filled with the following lines:

[... a lot of text ...]
> org.pitest.mutationtest.engine.gregor.mutators.RemoveConditionalMutator_EQUAL_ELSE
>> Generated 3 Killed 3 (100%)
> KILLED 3 SURVIVED 0 TIMED_OUT 0 NON_VIABLE 0
> MEMORY_ERROR 0 NOT_STARTED 0 STARTED 0 RUN_ERROR 0
> NO_COVERAGE 0
--------------------------------------------------------------------------------
> org.pitest.mutationtest.engine.gregor.mutators.RemoveConditionalMutator_ORDER_IF
>> Generated 3 Killed 3 (100%)
> KILLED 3 SURVIVED 0 TIMED_OUT 0 NON_VIABLE 0
> MEMORY_ERROR 0 NOT_STARTED 0 STARTED 0 RUN_ERROR 0
> NO_COVERAGE 0
--------------------------------------------------------------------------------
> org.pitest.mutationtest.engine.gregor.mutators.rv.CRCR3Mutator
>> Generated 7 Killed 6 (86%)
> KILLED 6 SURVIVED 1 TIMED_OUT 0 NON_VIABLE 0
> MEMORY_ERROR 0 NOT_STARTED 0 STARTED 0 RUN_ERROR 0
> NO_COVERAGE 0
--------------------------------------------------------------------------------
> org.pitest.mutationtest.engine.gregor.mutators.ConditionalsBoundaryMutator
>> Generated 3 Killed 2 (67%)
> KILLED 2 SURVIVED 1 TIMED_OUT 0 NON_VIABLE 0
> MEMORY_ERROR 0 NOT_STARTED 0 STARTED 0 RUN_ERROR 0
> NO_COVERAGE 0
[... a lot of text ...]

The lines with SURVIVED 1 mean that some code in the system under test was changed, and the tests passed – which means that the tests actually weren’t testing everything, although the code coverage was 100%!

The graphical view tells a bit clearer on what is going on:

$ xdg-open target/pit-reports/*/index.html

From the picture, it’s clear that the code wasn’t actually well tested, so let’s change the tests to match reality:

@Test
public void TestAlcoholLegalAges()
{
    assertFalse(App.canDrinkAlcohol(16, App.CountryCode.FI));
    assertFalse(App.canDrinkAlcohol(17, App.CountryCode.FI));
    assertTrue(App.canDrinkAlcohol(18, App.CountryCode.FI));
    assertTrue(App.canDrinkAlcohol(19, App.CountryCode.FI));

    assertFalse(App.canDrinkAlcohol(19, App.CountryCode.US));
    assertFalse(App.canDrinkAlcohol(20, App.CountryCode.US));
    assertTrue(App.canDrinkAlcohol(21, App.CountryCode.US));
    assertTrue(App.canDrinkAlcohol(22, App.CountryCode.US));

    assertFalse(App.canDrinkAlcohol(14, App.CountryCode.DE));
    assertFalse(App.canDrinkAlcohol(15, App.CountryCode.DE));
    assertTrue(App.canDrinkAlcohol(16, App.CountryCode.DE));
    assertTrue(App.canDrinkAlcohol(17, App.CountryCode.DE));
}

…run the tests:

$ mvn test
[...]
[INFO] Running fi.bytecraft.mutations.AppTest
[ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.019 s <<< FAILURE! - in fi.bytecraft.mutations.AppTest
[ERROR] TestAlcoholLegalAges(fi.bytecraft.mutations.AppTest)  Time elapsed: 0.004 s  <<< FAILURE!
java.lang.AssertionError
        at fi.bytecraft.mutations.AppTest.TestAlcoholLegalAges(AppTest.java:19)
[...]

…after which we can see the bug in the original code:

-       } else if (country == CountryCode.US && age < 20) {
+       } else if (country == CountryCode.US && age <= 20) {

After that change, the tests pass. When re-running pitest, the report is green!

In this example, we managed to find a bug with mutation testing, which wasn’t visible with standard code coverage measurements.

Not a panacea

Mutation testing, like all other tools, has its own downsides. The most clear ones are that running tests is a lot slower when mutating, and the reports often contain false positive errors. Like other tests, mutation testing doesn’t guarantee that the code works.

Mutation testing is also limited to only doing mutations which the library has implemented. These include swapping operators to other operators, and changing the values of constants. In theory, pitest could be extended with custom mutators, but it’s not trivial.

On test coverage

Some projects require that the test coverage stays over some arbitrary limit. This usually only causes that the tests are written for test coverage, rather than focusing on testing what is actually critical for the system to work.

The same applies for mutation testing. Although the frameworks can export CI-friendly numbers, it doesn’t mean that these should be used as-is. Before you set up your CI to fail the build when the mutation test coverage is under 100%, please do think whether that is actually necessary for your project, or would it be more useful to focus on things like documentation instead.

False positives

Occasionally code pops up, for which mutation testing gives false positive errors. The simple example is the clamp function here:

public static int clamp(int x, int max) {
    if(max <= x) return max;
    return x;
}

When you write a bunch of tests for this:

@Test
public void TestMax()
{
    assertTrue(App.clamp(0, 2) == 0);
    assertTrue(App.clamp(1, 2) == 1);
    assertTrue(App.clamp(2, 2) == 2);
    assertTrue(App.clamp(3, 2) == 2);

    assertTrue(App.clamp(0, -1) == -1);
    assertTrue(App.clamp(1, -1) == -1);
    assertTrue(App.clamp(2, -1) == -1);
}

Although the tests test everything which the code covers, pitest isn’t happy:

This is caused by the nature of the code: iff x == max, it doesn’t matter which path the code takes, as the return value is the same. Because of this, tests cannot detect changing the <= operator to <. In these situations, the only choices are to either disable mutation testing for the function altogether, or to disable the specific mutations which don’t affect the code.

Summary

Mutation testing is one tool among others. If your language supports some easy-to-use library for it, I recommend checking it out. However, don’t think that getting green reports out of that means working code.

Links


  1. 1) naturally, not telling the country or the age of the person ↩︎

Copyright (c) 2025 Jaakko Hannikainen