Why do we retrospect on our incidents? Why spend the time doing those write-ups and holding review meetings? We don’t do this work as some sort of intellectual exercise for amusement. Rather, we believe that if we spend the time to understand how the incident happened, we can use that insight to improve the system in general, and availability in particular. We improve availability by preventing incidents as well as reducing the impact of incidents that we are unable to prevent. This post-incident work should help us do both.
The typical approach to post-incident work is to do a root cause analysis (RCA). The idea of an RCA is to go beyond the surface-level symptoms to identify and address the underlying problems revealed by the incident. After all, it’s only by getting at the root at the problem that we will be able to permanently address it. When doing an RCA, when we attach the label root cause to something, we’re making a specific claim. That claim is: we should focus our attention on the issues that we’ve labeled “root cause”, because spending our time addressing these root causes will yield the largest improvements to future availability. Sure, it may be that there were a number of different factors involved in the incident, but we should focus on the root cause (or, sometimes, a small number of root causes), because those are the ones that really matter. Sure, the fact that Joe happened to be on PTO that day, and he’s normally the one that spots these sorts of these problems early, that’s interesting, but it isn’t the real root cause.
Remember that an RCA, like all post-incident work, is supposed to be about improving future outcomes. As a consequence, a claim about root cause is really a prediction about future incidents. It says that of all of the contributing factors to an incident, we are able to predict which factor is most likely to lead to an incident in the future. That’s quite a claim to make!
Here’s the thing, though. As our history of incidents teaches us over and over again, we aren’t able to predict how future incidents will happen. Sure, we can always tell a compelling story of why an incident happened, through the benefit of hindsight. But that somehow never translates into predictive power: we’re never able to tell a story about the next incident the way we can about the last one. After all, if we were as good at prediction as we are at hindsight, we wouldn’t have had that incident in the first place!
A good incident retrospective can reveal a surprisingly large number of different factors that contributed to the incident, providing signals for many different kinds of risks. So here’s my claim: there’s no way to know which of those factors is going to bite you next. You simply don’t possess a priori knowledge about which factors you should pay more attention to at the time of the incident retrospective, no matter what the vibes tell you. Zeroing in on a small number of factors will blind you to the role that the other factors might play in future incidents. Today’s “X wasn’t the root cause of incident A” could easily be tomorrow’s “X was the root cause of incident B”. Since you can’t predict which factors will play the most significant roles in future incidents, it’s best to cast as wide a net as possible. The more you identify, the more context you’ll have about the possible risks. Heck, maybe something that only played a minor role in this incident will be the trigger in the next one! There’s no way to know.
Even if you’re convinced that you can identify the real root cause of the last incident, it doesn’t actually matter. The last incident already happened, there’s no way to prevent it. What’s important is not the last incident, but the next one: we’re looking at the past only as a guide to help us improve in the future. And while I think incidents are inherently unpredictable, here’s a prediction I’m comfortable making: your next incident is going to be a surprise, just like your last one was, and the one before that. Don’t fool yourself into thinking otherwise.