Behavioral vs Implementation Testing

Implementation vs behavior focused testing #

Before we get stuck in, let's make sure we're all on the same page with a few definitions.

Implementation tests focus on how your code accomplishes a task.
Behavior tests focus on ensuring the task has been accomplished.

The difference is so subtle you might miss it if you sneeze. But it's there, and it matters.

The difference is that with behavior focused testing, we're testing the edges or boundaries of a system. What happens inside those boundaries is not our concern. There's a behavioral promise in place between our system and the users of that system. Our tests should ensure that promise is being fulfilled.

Still with me? Clear as mud? Don't worry this will start making sense soon. Ok moving on.

Some people tend to argue (or politely suggest) that all we're really talking about here is unit vs integration tests. Well, that's not really true and here's why:

def delete_all_items(items):
    items.clear()

from unittest.mock import MagicMock

def test_delete_all_items_implementation():
    items = MagicMock()
    delete_all_items(items)
    items.clear.assert_called_once()


def test_delete_all_items_behavior():
    items = [1, 2, 3, 4]
    delete_all_items(items)
    assert len(items) == 0

Clearly these are not integration tests, yet we can still define one as being implementation focused while the other is behavior focused.

The first test is brittle and wont survive refactoring. The moment we change how delete_all_items works, the test breaks regardless of whether or not the function still accomplishes the same task.

The behavior focused test on the other hand is focused on the end result. What is delete_all_items actually trying to accomplish? Well, in the end we want items to be empty and we don't really care how that happens.

Now we're getting to the heart of the matter, we want our code to perform a certain task and all we really care about is that the task is completed successfully, regardless of how it happens.

The problem with implementation tests #

One of the main problems with implementation focused tests is that they're brittle and seldom survive code refactoring.

So I already see a few hands going up to say: But that's a good thing, I want my tests to fail if any part of my code changes. How else will I know if something is broken?

To answer that question, let's jump into a slightly more complex example. Here we're coding away in our job at a AAA game studio, developing the space ship registry system for our latest game.

from dataclasses import dataclass
from typing import List, Optional
from enum import Enum

class ShipClass(Enum):
    FIGHTER = "fighter"
    CRUISER = "cruiser"
    BATTLESHIP = "battleship"

@dataclass
class SpaceShip:
    registry_id: str
    ship_class: ShipClass

class InMemoryShipRegistry:
    """
    Manages ship registration for the United Space Federation.
    Ships must be registered before they can dock at space stations
    or engage in legal trade routes.
    """
    def __init__(self):
        self._registered_ships: List[SpaceShip] = []
        
    def register_ship(self, ship: SpaceShip) -> None:
        """Register a new ship in the federation database."""
        self._registered_ships.append(ship)

    def get_ship(self, registry_id: str) -> Optional[SpaceShip]:
        """Lookup a ship by its unique registry identifier."""
        return next(
            (ship for ship in self._registered_ships if ship.registry_id == registry_id),
            None
        )

    def get_ships_by_class(self, ship_class: ShipClass) -> List[SpaceShip]:
        """Get all ships of a specific class."""
        return [
            ship for ship in self._registered_ships
            if ship.ship_class == ship_class
        ]

Ok so we've got our ship registry. Let's write a couple implementation focused tests:

def test_registered_ship_exists_in_registry():
    # Arrange
    ship = SpaceShip(registry_id="USS-FALCON-9", ship_class=ShipClass.FIGHTER)
    registry = InMemoryShipRegistry()
    
    # Act
    registry.register_ship(ship)
    
    # Assert
    assert ship in registry._registered_ships
    assert len(registry._registered_ships) == 1

def test_get_ships_by_class(self):
    # Arrange
    registry = InMemoryShipRegistry()
    fighter = SpaceShip("USS-FALCON-9", ShipClass.FIGHTER)
    cruiser = SpaceShip("USS-VICTORY", ShipClass.CRUISER)
    registry._registered_ships = [fighter, cruiser]

    # Act
    fighters = registry.get_ships_by_class(ShipClass.FIGHTER)
    
    # Assert
    assert len(fighters) == 1
    assert fighters[0] is fighter

Wow those tests kind of hurt the eyes, but let's go with it. As you can see, in both cases we're coupling our test code tightly to the implementation of the system under test.

If anything changes in the underlying mechanics of the InMemoryShipRegistry, it's likely our tests are going to break even if the functionality remains unchanged. But more on that later.

Now let's write a couple behavior focused tests:

def test_registered_ship_exists_in_registry(self):
    # Arrange
    ship = SpaceShip("USS-FALCON-9", ShipClass.FIGHTER)
    registry = InMemoryShipRegistry()
    
    # Act
    registry.register_ship(ship)
    found_ship = registry.get_ship(ship.registry_id)
    
    # Assert
    assert found_ship == ship

def test_get_ships_by_class(self):
    # Arrange
    registry = InMemoryShipRegistry()
    fighter1 = SpaceShip("USS-FALCON-9", ShipClass.FIGHTER)
    fighter2 = SpaceShip("USS-XWING-RED5", ShipClass.FIGHTER)
    cruiser = SpaceShip("USS-VICTORY", ShipClass.CRUISER)
    
    # Act
    registry.register_ship(fighter1)
    registry.register_ship(fighter2)
    registry.register_ship(cruiser)
    fighters = registry.get_ships_by_class(ShipClass.FIGHTER)
    
    # Assert
    assert len(fighters) == 2
    assert all(ship.ship_class == ShipClass.FIGHTER for ship in fighters)

Ok so now we're testing the boundaries of the system under test. We're testing the behavior.

First we make sure that a ship can be registered by using the API's which the InMemoryShipRegistry gives to us. We don't care how a ship is registered, we just want to know that it gets done.

Next we make sure that we can get all ships by class, again using the API's which the InMemoryShipRegistry provides.

Refactor and improve performance with confidence #

But maybe you're not sold yet. Maybe we need to go a step further and do some refactoring of our InMemoryShipRegistry to see how that affects our tests.

In our new registry, we're going to use a dictionary for registered ships. We're also going to introduce a dictionary to manage ships by class. Here we go:

class InMemoryShipRegistry:
    def __init__(self):
        self._ships_by_registry = {}
        self._ships_by_class = {
            ship_class: [] for ship_class in ShipClass
        }
        
    def register_ship(self, ship: SpaceShip) -> None:
        self._ships_by_registry[ship.registry_id] = ship
        self._ships_by_class[ship.ship_class].append(ship)
        
    def get_ship(self, registry_id: str) -> Optional[SpaceShip]:
        return self._ships_by_registry.get(registry_id, None)
        
    def get_ships_by_class(self, ship_class: ShipClass) -> List[SpaceShip]:
        return self._ships_by_class[ship_class].copy()

All good so far. We now have a shiny new underlying implementation of our InMemoryShipRegistry and have ensured that our public facing API's continue to work.

Here's where the rubber meets the road. Let's run our existing implementation focused tests:

Found 2 test(s).
FF
Ran 2 tests in 0.01s
FAILED (failures=2)

Darn. Looks like we're going to have to spend time fixing our broken tests because they both relied heavily on the internal implementation.

Next let's see what our existing behavior focused tests have to say:

Found 2 test(s).
..
Ran 2 tests in 0.01s
PASSED

Awesome, all green. Our expected behavior, the promise that exists between our system under test and the user, is still in tact. Let's move on with our day.

What are the benefits of behavior testing? #

Let's rapid fire some of the benefits of a more behavior focused approach to testing:

Maintainability #

Tests survive refactoring: Less baby sitting your tests
Focus on what matters to users or consumers
Reduced coupling between tests and implementation choices

In a nutshell, we spend less time baby sitting tests.

Our tests will survive through implementation changes and performance optimizations. We can go from having a space station's docking system which uses a simple list to a complex queuing system without breaking tests.

Better documentation #

Tests describe system behavior
Tests serve as examples of how to use the system
New team members are able to familiarize themselves with the capabilities of the system by reading the tests

Documenting an active, ever evolving code base isn't always easy. In spite of best efforts, doc strings, wiki's and the like aren't always 100% in sync with the code they're documenting.

Behavior focused tests on the other hand provide documentation automatically.

Meaningful test coverage #

Tests are written at the appropriate level of abstraction
Redundant tests are avoided
Quality over quantity

Here's looking at you 100% coverage gang. Let's focus on quality over quantity.

Improved development speed #

Less time spent maintaining tests
Faster feedback cycles
More confidence in making changes

I think these speak for themselves.

Should all tests be behavior focused tests? #

Based on everything I've presented so far, you might think I'm advocating for completely eliminating implementation focused tests. But in reality, there is room (and a need) for both styles.

A few scenarios that come to mind where implementation focused tests are appropriate could be:

Complex algorithms
Performance-critical code
Core game mechanics

Back to our space game example, we might have an algorithm which determines weapon damage. Our algorithm takes into account armor penetration physics, critical hit probabilities, damage falloff over distance etc. These calculations should likely have fine grained implementation focused tests.

Finding the balance #

I find that focusing completely on one or the other style of test in all situations ends up in pain, one way or the other.

My preference is to default to behavior focused testing the vast majority of time. When exceptions to that rule arise, it's clear that I'm dealing with a small, tightly constrained piece of code where the implementation really matters and should therefore be tested.

Should you be testing implementation details?