Failure Modes and Blast Radius

This is part of a series where I share my professional values and working style openly, so that future colleagues can get a sense of who I am before we ever work together.

Failure Modes Evolve with the System

Failure modes and blast radius issues are often unexpected. It is common to discover new ones as systems evolve and change. Systems can change in a variety of ways:

Increased utilization by customers
Growth of the data layer
Changes in components
Changes in the platform
New features
New usage patterns
Automated testing patterns

It's unlikely that any team will be able to anticipate all the failure modes during system design, but it is possible to learn the patterns, and the ways in which system design can make detecting and dealing with those scenarios easier.

There's one example of a failure mode that I witnessed and grappled with that I'd like to describe, to illustrate the perspective I bring and the way that I reason about systems. This is an example of a noisy neighbor problem, over-utilization, and blast radius.

I'm going to leave out identifiable details from the story, but describe things in enough detail to be relevant on a principles basis.

The System

This system was composed in part of a series of load balancers routing traffic to pools of worker nodes. The worker nodes would route requests to customer backend systems, wait for their reply, and then send that response back to the traffic originator.

As each node handled many concurrent requests, the primary resource contention on the node was the network. Imagine this:

A customer is pushing 10,000 QPS through a load balancer.
Those requests are distributed to the target group.
The number of concurrent requests in flight on a given node is a product of several factors: the request rate, the latency of handling the request on the worker, and the time the request spent at the customer backend system.
If the customer backend system is highly performant, this drives the total number of concurrent requests downward.
When a customer backend system slows down for any reason, the number of concurrent requests in flight increases if their traffic level stays the same.
Because the workers are limited in how many network connections they can maintain based on the constraints of the operating system, network performance, and available memory, customer backend slowdowns can trigger an availability issue when horizontal scaling cannot keep up with demand.

How the Failure Unfolded

Here's how it would play out:

A customer is running a high QPS rate through the system.
The customer backend system slows down.
The workers now have more requests in flight, and this shows up in network utilization metrics as climbing TCP TIME_WAIT sockets.
Once the workers hit their limit of total sockets in use, they start to serve HTTP 5XX responses to both the client system and the load balancer.
The lack of a robust health check on the worker allows it to continue receiving traffic from the load balancer, even while it is unable to service it.
That causes the DNS load balancing health checks to fail.
When those fail, the traffic fails over to the failover site, or in some cases a different geo-routed stack.
This can introduce the thundering herd problem, where suddenly a failover stack is receiving 10x traffic over the course of several minutes.
If this issue is happening on a target group shared by other customers, those customers' traffic also fails over.
At this point, we discover how robust the failover plans actually were, and whether or not the customers had prepared for the shift in traffic arriving from a particular stack.

The Root Cause and the Advocacy

This was a huge pain point for the application, customers, and the engineering and SRE teams. As an SRE, I was on the incident response for these events, writing the RCA documents, leading the post-mortems, and discussing the issue with engineering stakeholders.

The reason that this failure mode occurred came back to a decision years in the past, when load balancer health checks were not seen as a valuable feature to implement robustly. That lack of an accurate health check led to so much customer pain over the years, and also prevented movement toward other resiliency improvements, such as auto-scaling.

The health check issue was one I noticed immediately when I joined the team, and I advocated for it as each incident rolled by. The resistance was less about correctness and more about team dynamics. There was a perception that the health check was a pet project of the SRE team, and that it competed for resources and time against what the engineering team could take on. We continually advocated for its implementation, and I reached out to a specific highly capable and insightful engineer to provide essential context and production figures that he could use to help make the case within his team that this project was worthy of engineering time.

In the end, a high quality health check was implemented, and used so that auto-scaling could be added to the platform, finally introducing the elasticity for horizontal scaling of workers that was so desperately needed.

Sometimes it can be difficult for engineers of today to remember why certain features were added to our distributed systems tooling. I have the benefit of seeing those lessons play out first hand, and the experience of advocating for prioritizing these features, even when a team has gotten by without them for years. In today's world of Kubernetes and containerized orchestration platforms, the lack of a health check may be hard to imagine, but a health check that isn't well-implemented -- one that doesn't truly detect its own ability to service a request -- is less difficult to imagine, since the implementation details affect the outcomes so dramatically.