Before we get started, could you please raise your hand if you’ve read and are familiar with a paper entitled “How Complex Systems Fail”? If you’re not familiar with this, you can Google it.
You’ll find it easily. I think that your neighbors in the audience who did raise their hand would tell you that it was worth it.
First, I want to disclose some information and set expectations. I’ve never given this talk before and I did finish it 12 minutes ago.
I’m not sure how this is going to go.
Have you ever felt unsure how some things were going to go? I’m at SREcon just thinking, have you ever been responding to an incident where you and your coworkers are like, “Oh, it’s probably this, it’s probably this thing.” We should probably do X. We should probably run this command – it should fix everything. And you’re just about to hit enter on that command and you thought, “I’m not sure how this is gonna go.”
That’s what’s happening right now with me.
This is the quote that I riffed on when coming up with the title for the talk.
This talk is about thinking critically about common beliefs that we have in this community – questioning conventional wisdom because some things may end up being misconceptions.
“It ain’t what you don’t know that gets you in trouble. It’s what you know for sure that just ain’t so.” This was written by Mark Twain… or maybe someone else, but somebody definitely wrote it.
Let me give you an example of what I mean. It’s not an SREcon-style example, but something more general. Think about a common misconception. The Great Wall of China is not visible from space. You’ve probably heard people claim it is.
It turns out that’s not true.
Here’s another one: a penny dropped from the Empire State Building will not kill anybody if it hits them in the head.
These are examples of things we say, but there’s no real cost to debunking these. In fact, to say that they were true is like cocktail party trivia. And if they’re not true, you still get to say, “Did you know? Fun fact, not true.” These aren’t things that matter.
A couple of years ago, my colleague David Woods said this to me:
“We cannot call it a scientific field unless we can admit we’ve gotten things wrong in the past.”
Here’s Dave, giving testimony at the Senate science committee on the future of NASA in 2003 in the wake of the space shuttle Columbia during hearings on Capitol Hill.
I get to work with Dave!
So as a community: do we admit that we’ve gotten some things wrong in the past? Do we have a record of being productively skeptical of our assumptions or our implicit beliefs about our work?
I’ve got some ideas on that…but first: are we truly a scientific field?
Do we make productive use of empirical research to inform how we do our work?
I think we absolutely are a field that warrants real attention from real scientists to study our world. It’s not 2015 anymore. There’s been a lot more practitioners who’ve become practitioner researchers.
If you see me showing your research up on the screen, I want you to raise your hand and say, “That’s me” out loud.
Many of you also saw Courtney Nash and Laura Maguire’s talk yesterday where they presented their research on trade-offs being made in time-pressured and consequential scenarios. My point is that not just that we have a responsibility to our field to think critically about what we take as “given” and accurate, but that there is a growing and energetic number of us who are demonstrating expertise as scholars in doing this well enough to play in the “big leagues” of academia.
Boorstin does a much better job than Mark Twain (or whoever it was) with this quote: “The greatest obstacle to discovery is not ignorance. It’s the illusion of knowledge.”
Here’s my plan. I wanna do a quick leisurely tour through a couple of very specific misconceptions. Some of them are more like topics or broad groups of misconceptions.
Do we ever revisit our conventional wisdom with healthy doses of critical thinking? I think we do, actually. we heard a lot of that during this conference. This is what’s great about communities of practice: we tend not to blindly believe anything that contrasts with what our real experience with the messy details shows to us…every day.
For example…if I broadened what I mean about “our field” for a minute to include all programming work, i can find actual demonstrations where we have confronted ideas once thought to be “gospel”…
This example is on the topic of engineer productivity. Until the early nineties, the accepted way to track and understand software engineers’ productivity was amazingly simple: you counted the lines of code they wrote.
Then someone, after what I can only assume was about 30 seconds of thinking about it, said and wrote: “Hey, I think this is bullshit.”
In 1995, Capers Jones said, “the use of lines of code metrics for productivity and quality studies is to be regarded as professional malpractice.” This was a mic drop in the nineties. We’ve got a real world example here.
For many years, “lines of code” was a very big deal. It came from programming Fortran and COBOL on punch cards. It was never actually a good measure of anything, but people were tabulating it, thinking it did.
Here’s something else you’ll come across from time to time: The idea that “change” is either the only cause of incidents, or the leading cause, or some variation of that.
Has anybody ever heard of this?
Now, I want to be open here. I don’t have enough money to buy the report from Garner. But in 2015, Gartner said that 85% of performance incidents can be traced to changes.
I want to propose another take based on my experience. Let me know if I’m off base here:
Changes are also one of the leading causes of resolving incidents, yeah?
And is it fair to say that all prevented incidents are triggered by making changes?
So, sorry, I just want to make sure I understand this… Changes are bad…but also…changes are good?
Wait, what were we talking about?
Sometimes this fuzzy hard-to-pin-down conventional wisdom is even implicit in the way we talk about it. We talk about code freezes, but underlying that is an assumption. Sometimes we can revisit an assumption and it turns out to be true, which is great. That’s how science works – it’s called validation. Sometimes you revisit and question whether something is even real. Are you sure that’s a thing?
We need productive skepticism. In 2018, I wrote a blog post calling into question how we’re gathering all these numbers around incidents that don’t have anything to do with the content of the incident, but certainly have to do with the dimensions of the incident – like the frequency of incidents over time, the length of different portions of these incidents.
This blog post was really satisfying to write.
Since then, Courtney Nash has destroyed this topic.
If you haven’t seen or read any of her work on this, then you’re absolutely missing out. Courtney has expanded on this and has gone in incredibly interesting and discussion-provoking directions. You should put it on your to-do list.
I often see framings like this in this sequence, where incidents follow particular steps. Sometimes they’re called steps, sometimes they’re called phases…
It’s quite common to think of them this way, and it might be productive in certain circumstances. But there are variations of this. This one, which is different in a visual way, clearly took a lot of time to put together.
…sometimes the whole thing is called a cycle, sometimes it’s called a “lifecycle.”
It’s quite common to think of them this way, and it might be productive in certain circumstances. But there are variations of this. This one, which is different in a visual way, clearly took a lot of time to put together.
But there are variations of this. This one, which is different in a visual way, clearly took a lot of time to put together.
But when we look at our experience in this community with real incidents, it’s weird because they never feel as neat and orderly and crisp as they do on the slide. As I was making the slide, I was thinking this particular representation certainly looks like Excel. Like you’re only one formula away from getting your work done.
Is this how most incidents play out? Can we assume that enough incidents play out this way that we can ignore or dismiss those that don’t?
(Someone in the front row yelled, “What do you mean by an incident?” Yes, friend….yes.)
Yeah, we can make some assumptions, like “there is no friction” or “the cat is a cube,” and things can work, but we’re contorting reality.
And you may say to yourself (and you’d be right): “We all know this. We get it. It’s just a model. We are very smart. Everybody understands this.”
I know you all know this…but do they?
Because I do know this: I have experience, and some of you might have experience, with organizations that reward, punish, and make significant decisions based on numbers which critically depend on this model being concretely accurate.
So it’s not just “it doesn’t go like that.” It’s “hey, it doesn’t go like that because three steps away somebody is getting a report and they’re gonna make decisions based on that.”
I’m going to walk you through a real incident here — and ironically, I am oversimplifying. So, it’s a regular afternoon and somebody notices, “Huh, that’s weird.” Not shocking, but more like, “Steve, you going to lunch? I’m just gonna hang back and check this thing out.”
As time passes, things get a little weirder. You’re like, “Hold on a second. This is not good. This is definitely a thing.” You don’t know if it’s an incident or not, but it’s definitely a thing. So you spend some time figuring out what to do about it. You try to fix it. And then you spend more time to figure out whether or not that worked.
Here’s my question: where is “diagnose”?
This is an example, but I have no doubt that almost everybody in this room can tell me a story about an incident that looked like this. Cases where it wasn’t like, “You are now leaving the Diagnose Phase. What you have entered is Mitigation.”
I love Honeycomb for this write-up. If you detect something after it’s resolved, is the time-to-resolve negative? [In a whisper: Let’s not include it. They’ll never know.]
One of the authors of ITIL is having his mind completely blown right now.
It’s not that we put stuff down and the model is wrong. That’s reality, that models are wrong. It’s that those models can kick off distant decisions that genuinely have consequences.
If we look at real world, concrete cases of anomaly response that have been validated across different domains of work from a cognitive work perspective, we see a different pattern.
Here’s a representation of something that my colleague Dr. Woods and others developed from cases in nuclear power control rooms. Note: it’s not a linear sequence. The types of cognitive work are intertwined and interdependent.
It’s not a sequence. It’s not a cycle. It’s a dynamic phenomenon. But unfortunately, it does not lend well to populating a spreadsheet. Still, it is the reality that you probably all recognize. Even if you’ve never seen this diagram before, you could probably work out how it works intuitively.
Let’s talk about repeat incidents. I would like to plug Em Ruppe’s SRE con talk in EMEA last year, I think it was, on repeat incidents. Are there people whose organization explicitly talks about repeat incidents being significant or different or treated differently? Anybody?
We’ve talked with organizations where repeat incidents are very specifically counted and various reports are tabulated. I don’t work there and don’t exactly know what the impact of those reports are formally or informally.
Look, we often discuss “repeat incidents,” and if it enables a discussion about a set of incidents then that’s great. It’s fine colloquially.
However, I want to make an assertion. The criteria for labeling an incident as a repeat matters more than that there was a repeat. It’s almost an invitation. When someone says “Oh yeah, this is a repeat” or “this happened again” — the cognitive systems engineer in me just hears, “Ask me about more detail!” Was it literally at the exact same point in time, like, same Unix epoch? No. Was it the same people who responded? No.
Here. I picked this case out of the VOID. Are you familiar with the VOID, which Courtney has carefully curated and expanded into an empire?
It’s a Second Life incident. I’ve never played or used Second Life, but I’m aware of it. This is a great case. “Instance started Thursday, blah blah blah, and it went on for a few hours,” and then this is what stuck out to me: “but it magically went away on its own.”
Next sentence: “The same thing happened again, but once again, it went away on its own.” Then it happened the next day.
My question is: are these three incidents? One?
They’re in the same write-up, so there’s an implicit relationship between them. But are they actually three? Because that could screw your average. Or it could help, maybe you need it!
Okay. Incident response.
Again, these are just assertions. I’m less invested in some of these than I’m presenting. The talk isn’t about any one of them. I’m not here to convince you that my reflections are valid. This is about the exercise of discussing and questioning.
I’m going to make an assertion here. An organization can be the most skilled and efficient at keeping stakeholders up to date about ongoing incidents and still be terrible about learning from them or responding to them.
They could be the best in the industry. Keeping stakeholders up to date is an important part of what surrounds incidents. But it’s different than incident handling.
You could say that “incident response” is a broader umbrella. Maybe you’d say “incident management.” Maybe you have some other term for it. But I’m talking about the activities that hands-on practitioners have skills and expertise to engage in. Think back to that anomaly response model that Woods and others developed.
Yes, there are also people who are responsible for resourcing and providing support for the people who are handling the incident. Leadership, business, relationships, customers, other parts of the organization. It’s not that they’re not important, but they don’t have the expertise to do the handling work.
You can get really good in a portion of incident response. It’s pretty multifaceted. I’m not saying you can’t get good at each of them, but I am saying that they’re different.
Here’s my proposal.
The capital-I Ideal when it comes to practitioners handling an incident: the people who respond to it are the exact people who can immediately recognize what is happening and the exact people who know what to do about it. And that anything else that can support those two bits, is paramount.
It’s not that other things aren’t important. But they’re secondary. They may be necessary, but they’re secondary.
Here’s a thought exercise. If you had to choose between having six significant digits of precision on customer impact in real time, or having the incident fluidly handled — it’s handled quicker with less disruption and drag — which one do you think the company would choose?
The wild part about this is that when you achieve the ideal, here’s what happens. People show up and say, “Oh look, it’s blah. Okay. You do that. I’ll do this. Are we good here? We’re good? Okay, let’s go to lunch.” When that happens, it’s not even labeled as an incident because it was handled so incredibly fluidly. It wasn’t even difficult for them.
This is all about expertise. It’s worthwhile and productive to invest in anything that supports, expands, augments, amplifies, broadens, diversifies expertise throughout the population of hands-on practitioners who respond to incidents.
And that can take all kinds of different shapes, many of which you already have. It shows up in code review. It shows up in everybody who works at your company. If I ask a bunch of people at your company to name five people who give amazing code reviews, you’re going to get a pretty strong overlap.
They’re known. Might not be written on the wiki, but everybody knows who they are. That’s what expertise is.
Anytime you have a situation where tenured veterans who’ve seen some shit can sit around and tell stories —especially in discussion with new hires —let’s just say your competitors are hoping you won’t do that.
And look at that. I’m done with the talk.
Thanks for listening, everyone!