Users Act in Predictably Chaotic Ways

This is part of a series where I share my professional values and working style openly, so that future colleagues can get a sense of who I am before we ever work together.

The Testing Problem

One of the biggest challenges I witnessed in my career is testing. It sounds so simple: test it in a production-like environment before you release it. But what is production-like? Like production in which ways?

Usually that doesn't mean on the same scale as production workflows. Generally it means an environment that has all the necessary components of a production environment, without the scale. Another challenge is that even when you create your testing environment, many bugs, features, or behaviors are not apparent without the particular data sets configured, generated, or otherwise introduced by the customer through using the system.

Imagine the QA engineer, tasked with reviewing a feature of a distributed system, and verifying that it is correct before it goes out to production as a bug fix, hot fix, or part of a release. Distributed systems can fail in many often unpredictable ways. Determining the root cause can be quite time consuming, especially if there is a particular state of the system that needs to be introduced.

The QA Bottleneck

I have seen a QA team become bogged down in the maintenance of their testing environments. It can be a difficult situation for an organization to design around. Usually QA engineers are not hired for the same skill set as the developers, or the SREs that run the application in production. And yet, to do their work, it requires that they can quickly, accurately, and confidently ascertain the state of a complex system. If something isn't working, they need to be able to attribute the problem correctly to either the hosting platform, configuration values, the data layer, network, or a bug in the code. It's not a simple task.

I participated directly in the work that retooled an automated testing environment system, taking its reliance on infrastructure as code through Terraform a bit further. For context, this was roughly 2019-2022, and for many reasons, the workflow was not containerized.

Here are a few of the changes put in place during the redesign:

Keep all Terraform code and state. In a prior implementation, it was assumed that the IaC code could be treated as fire and forget. In practice, that was rarely the case.
Implement provisioning through a Jenkins pipeline, allowing us to rely on Jenkins' workflow management tooling instead of writing our own.
Contain each component in its own folder within a directory, making all components for the test environment easy to find within source control.
Use simple, lightweight Bash scripts to handle templating of Terraform (this was before Terragrunt was in wide adoption) and supplying of variables that couldn't be templated.
Use built-in retry logic to overcome transient network issues that could affect provisioning of any one of the 20+ components in the system.

Then the Users Showed Up

Once we overcame the issue of establishing consistently built test environments, new issues arose:

Users running too many test environments for the organization's cost tolerance
Users creating environments and leaving them running while idle
Users re-running environment creation with the exact same name and arguments
Users launching environments in the "wrong" region
Users not being able to track down who launched an environment or why, and whether or not it was safe to remove it

At a first read, perhaps it looks like I am blaming the user, but that's not my goal. My goal is to point out that users act in predictably chaotic ways, because they have their own concerns and motivations. If the system allows them to do something, they will assume that it was an acceptable thing to do.

A system built without the proper guardrails to guide users within the bounds of expected behavior will quickly devolve into chaos. We all like to think that it won't happen, but lived experience as a technologist quickly teaches you otherwise.

Iterating on the Pain Points

Once the pain point of simply creating, running, and validating test environments was solved, we moved quickly to improving the reliability of the environments, and to addressing the user pain points that emerged.

Adding a "created by" field helped in many cases, but also introduced another user behavior anti-pattern: ask a colleague to launch the environment so that it has their name on it.

Adding a user-controllable name for the environment allowed people to follow the discipline of creating environments labeled with a ticket ID, but also enabled them to use a ticket ID that matched the pattern but didn't actually exist.

Each guardrail solved one problem and revealed the next. That's not a failure of the design process. That's the design process.

Trust Over Controls

Staying in touch with your users and building trust and interest in good working practice is the most effective way to move quickly. High-trust environments are conducive to speed. You can't guardrail your way to a well-functioning team. At some point, the system has to trust the people using it, and the people have to trust that the system was built with their needs in mind.

Conversely, if you have users who seem determined to find another way to bend the rules of the system, maybe they should join your QA team.