2659 stories
·
0 followers

Revisiting What We Think We Already Know

1 Share

Before we get started, could you please raise your hand if you’ve read and are familiar with a paper entitled “How Complex Systems Fail”? If you’re not familiar with this, you can Google it.

You’ll find it easily. I think that your neighbors in the audience who did raise their hand would tell you that it was worth it.

First, I want to disclose some information and set expectations. I’ve never given this talk before and I did finish it 12 minutes ago.

I’m not sure how this is going to go.

Have you ever felt unsure how some things were going to go? I’m at SREcon just thinking, have you ever been responding to an incident where you and your coworkers are like, “Oh, it’s probably this, it’s probably this thing.” We should probably do X. We should probably run this command – it should fix everything. And you’re just about to hit enter on that command and you thought, “I’m not sure how this is gonna go.”

That’s what’s happening right now with me.

This is the quote that I riffed on when coming up with the title for the talk.

This talk is about thinking critically about common beliefs that we have in this community – questioning conventional wisdom because some things may end up being misconceptions.

“It ain’t what you don’t know that gets you in trouble. It’s what you know for sure that just ain’t so.” This was written by Mark Twain… or maybe someone else, but somebody definitely wrote it.

Let me give you an example of what I mean. It’s not an SREcon-style example, but something more general. Think about a common misconception. The Great Wall of China is not visible from space. You’ve probably heard people claim it is.

It turns out that’s not true.

Here’s another one: a penny dropped from the Empire State Building will not kill anybody if it hits them in the head.

These are examples of things we say, but there’s no real cost to debunking these. In fact, to say that they were true is like cocktail party trivia. And if they’re not true, you still get to say, “Did you know? Fun fact, not true.” These aren’t things that matter.

A couple of years ago, my colleague David Woods said this to me:
“We cannot call it a scientific field unless we can admit we’ve gotten things wrong in the past.”

Here’s Dave, giving testimony at the Senate science committee on the future of NASA in 2003 in the wake of the space shuttle Columbia during hearings on Capitol Hill.

I get to work with Dave!

So as a community: do we admit that we’ve gotten some things wrong in the past? Do we have a record of being productively skeptical of our assumptions or our implicit beliefs about our work?

I’ve got some ideas on that…but first: are we truly a scientific field? 

Do we make productive use of empirical research to inform how we do our work?

I think we absolutely are a field that warrants real attention from real scientists to study our world. It’s not 2015 anymore. There’s been a lot more practitioners who’ve become practitioner researchers.

If you see me showing your research up on the screen, I want you to raise your hand and say, “That’s me” out loud.

Many of you also saw Courtney Nash and Laura Maguire’s talk yesterday where they presented their research on trade-offs being made in time-pressured and consequential scenarios. My point is that not just that we have a responsibility to our field to think critically about what we take as “given” and accurate, but that there is a growing and energetic number of us who are demonstrating expertise as scholars in doing this well enough to play in the “big leagues” of academia.


Boorstin does a much better job than Mark Twain (or whoever it was) with this quote: “The greatest obstacle to discovery is not ignorance. It’s the illusion of knowledge.”

Here’s my plan. I wanna do a quick leisurely tour through a couple of very specific misconceptions. Some of them are more like topics or broad groups of misconceptions.

Do we ever revisit our conventional wisdom with healthy doses of critical thinking? I think we do, actually. we heard a lot of that during this conference. This is what’s great about communities of practice: we tend not to blindly believe anything that contrasts with what our real experience with the messy details shows to us…every day.

For example…if I broadened what I mean about “our field” for a minute to include all programming work, i can find actual demonstrations where we have confronted ideas once thought to be “gospel”…

This example is on the topic of engineer productivity. Until the early nineties, the accepted way to track and understand software engineers’ productivity was amazingly simple: you counted the lines of code they wrote. 

Then someone, after what I can only assume was about 30 seconds of thinking about it, said and wrote: “Hey, I think this is bullshit.”

In 1995, Capers Jones said, “the use of lines of code metrics for productivity and quality studies is to be regarded as professional malpractice.” This was a mic drop in the nineties. We’ve got a real world example here.

For many years, “lines of code” was a very big deal. It came from programming Fortran and COBOL on punch cards. It was never actually a good measure of anything, but people were tabulating it, thinking it did.

Here’s something else you’ll come across from time to time: The idea that “change” is either the only cause of incidents, or the leading cause, or some variation of that.

Has anybody ever heard of this? 

Now, I want to be open here. I don’t have enough money to buy the report from Garner. But in 2015, Gartner said that 85% of performance incidents can be traced to changes.

I want to propose another take based on my experience. Let me know if I’m off base here:

Changes are also one of the leading causes of resolving incidents, yeah?

And is it fair to say that all prevented incidents are triggered by making changes?

So, sorry, I just want to make sure I understand this… Changes are bad…but also…changes are good?

Wait, what were we talking about?

Sometimes this fuzzy hard-to-pin-down conventional wisdom is even implicit in the way we talk about it. We talk about code freezes, but underlying that is an assumption. Sometimes we can revisit an assumption and it turns out to be true, which is great. That’s how science works – it’s called validation. Sometimes you revisit and question whether something is even real. Are you sure that’s a thing?

We need productive skepticism. In 2018, I wrote a blog post calling into question how we’re gathering all these numbers around incidents that don’t have anything to do with the content of the incident, but certainly have to do with the dimensions of the incident – like the frequency of incidents over time, the length of different portions of these incidents.

This blog post was really satisfying to write.

Since then, Courtney Nash has destroyed this topic.

If you haven’t seen or read any of her work on this, then you’re absolutely missing out. Courtney has expanded on this and has gone in incredibly interesting and discussion-provoking directions. You should put it on your to-do list.

I often see framings like this in this sequence, where incidents follow particular steps. Sometimes they’re called steps, sometimes they’re called phases…

It’s quite common to think of them this way, and it might be productive in certain circumstances. But there are variations of this. This one, which is different in a visual way, clearly took a lot of time to put together.

…sometimes the whole thing is called a cycle, sometimes it’s called a “lifecycle.”

It’s quite common to think of them this way, and it might be productive in certain circumstances. But there are variations of this. This one, which is different in a visual way, clearly took a lot of time to put together.


But there are variations of this. This one, which is different in a visual way, clearly took a lot of time to put together.

But when we look at our experience in this community with real incidents, it’s weird because they never feel as neat and orderly and crisp as they do on the slide. As I was making the slide, I was thinking this particular representation certainly looks like Excel. Like you’re only one formula away from getting your work done.

Is this how most incidents play out? Can we assume that enough incidents play out this way that we can ignore or dismiss those that don’t?

(Someone in the front row yelled, “What do you mean by an incident?” 😀 Yes, friend….yes.)

Yeah, we can make some assumptions, like “there is no friction” or “the cat is a cube,” and things can work, but we’re contorting reality.

And you may say to yourself (and you’d be right): “We all know this. We get it. It’s just a model. We are very smart. Everybody understands this.”

I know you all know this…but do they?

Because I do know this: I have experience, and some of you might have experience, with organizations that reward, punish, and make significant decisions based on numbers which critically depend on this model being concretely accurate.

So it’s not just “it doesn’t go like that.” It’s “hey, it doesn’t go like that because three steps away somebody is getting a report and they’re gonna make decisions based on that.”

I’m going to walk you through a real incident here — and ironically, I am oversimplifying. So, it’s a regular afternoon and somebody notices, “Huh, that’s weird.” Not shocking, but more like, “Steve, you going to lunch? I’m just gonna hang back and check this thing out.”

As time passes, things get a little weirder. You’re like, “Hold on a second. This is not good. This is definitely a thing.” You don’t know if it’s an incident or not, but it’s definitely a thing. So you spend some time figuring out what to do about it. You try to fix it. And then you spend more time to figure out whether or not that worked.

Here’s my question: where is “diagnose”?

This is an example, but I have no doubt that almost everybody in this room can tell me a story about an incident that looked like this. Cases where it wasn’t like, “You are now leaving the Diagnose Phase. What you have entered is Mitigation.”

I love Honeycomb for this write-up. If you detect something after it’s resolved, is the time-to-resolve negative? [In a whisper: Let’s not include it. They’ll never know.]

One of the authors of ITIL is having his mind completely blown right now.


It’s not that we put stuff down and the model is wrong. That’s reality, that models are wrong. It’s that those models can kick off distant decisions that genuinely have consequences.

If we look at real world, concrete cases of anomaly response that have been validated across different domains of work from a cognitive work perspective, we see a different pattern.

Here’s a representation of something that my colleague Dr. Woods and others developed from cases in nuclear power control rooms. Note: it’s not a linear sequence. The types of cognitive work are intertwined and interdependent.

It’s not a sequence. It’s not a cycle. It’s a dynamic phenomenon. But unfortunately, it does not lend well to populating a spreadsheet. Still, it is the reality that you probably all recognize. Even if you’ve never seen this diagram before, you could probably work out how it works intuitively.

Let’s talk about repeat incidents. I would like to plug Em Ruppe’s SRE con talk in EMEA last year, I think it was, on repeat incidents. Are there people whose organization explicitly talks about repeat incidents being significant or different or treated differently? Anybody?

We’ve talked with organizations where repeat incidents are very specifically counted and various reports are tabulated. I don’t work there and don’t exactly know what the impact of those reports are formally or informally.

Look, we often discuss “repeat incidents,” and if it enables a discussion about a set of incidents then that’s great. It’s fine colloquially.

However, I want to make an assertion. The criteria for labeling an incident as a repeat matters more than that there was a repeat. It’s almost an invitation. When someone says “Oh yeah, this is a repeat” or “this happened again” — the cognitive systems engineer in me just hears, “Ask me about more detail!” Was it literally at the exact same point in time, like, same Unix epoch? No. Was it the same people who responded? No.

Here. I picked this case out of the VOID. Are you familiar with the VOID, which Courtney has carefully curated and expanded into an empire?

It’s a Second Life incident. I’ve never played or used Second Life, but I’m aware of it. This is a great case. “Instance started Thursday, blah blah blah, and it went on for a few hours,” and then this is what stuck out to me: “but it magically went away on its own.”

Next sentence: “The same thing happened again, but once again, it went away on its own.” Then it happened the next day.

My question is: are these three incidents? One?

They’re in the same write-up, so there’s an implicit relationship between them. But are they actually three? Because that could screw your average. Or it could help, maybe you need it!

Okay. Incident response.

Again, these are just assertions. I’m less invested in some of these than I’m presenting. The talk isn’t about any one of them. I’m not here to convince you that my reflections are valid. This is about the exercise of discussing and questioning.

I’m going to make an assertion here. An organization can be the most skilled and efficient at keeping stakeholders up to date about ongoing incidents and still be terrible about learning from them or responding to them.

They could be the best in the industry. Keeping stakeholders up to date is an important part of what surrounds incidents. But it’s different than incident handling.

You could say that “incident response” is a broader umbrella. Maybe you’d say “incident management.” Maybe you have some other term for it. But I’m talking about the activities that hands-on practitioners have skills and expertise to engage in. Think back to that anomaly response model that Woods and others developed.

Yes, there are also people who are responsible for resourcing and providing support for the people who are handling the incident. Leadership, business, relationships, customers, other parts of the organization. It’s not that they’re not important, but they don’t have the expertise to do the handling work.

You can get really good in a portion of incident response. It’s pretty multifaceted. I’m not saying you can’t get good at each of them, but I am saying that they’re different.

Here’s my proposal.

The capital-I Ideal when it comes to practitioners handling an incident: the people who respond to it are the exact people who can immediately recognize what is happening and the exact people who know what to do about it. And that anything else that can support those two bits, is paramount.

It’s not that other things aren’t important. But they’re secondary. They may be necessary, but they’re secondary.

Here’s a thought exercise. If you had to choose between having six significant digits of precision on customer impact in real time, or having the incident fluidly handled — it’s handled quicker with less disruption and drag — which one do you think the company would choose?

The wild part about this is that when you achieve the ideal, here’s what happens. People show up and say, “Oh look, it’s blah. Okay. You do that. I’ll do this. Are we good here? We’re good? Okay, let’s go to lunch.” When that happens, it’s not even labeled as an incident because it was handled so incredibly fluidly. It wasn’t even difficult for them.

This is all about expertise. It’s worthwhile and productive to invest in anything that supports, expands, augments, amplifies, broadens, diversifies expertise throughout the population of hands-on practitioners who respond to incidents.

And that can take all kinds of different shapes, many of which you already have. It shows up in code review. It shows up in everybody who works at your company. If I ask a bunch of people at your company to name five people who give amazing code reviews, you’re going to get a pretty strong overlap.

They’re known. Might not be written on the wiki, but everybody knows who they are. That’s what expertise is.

Anytime you have a situation where tenured veterans who’ve seen some shit can sit around and tell stories —especially in discussion with new hires —let’s just say your competitors are hoping you won’t do that.


And look at that. I’m done with the talk. 

Thanks for listening, everyone!

Read the whole story
huskerboy
16 hours ago
reply
Seattle
Share this story
Delete

The Muppets’ Carol of the Bells

2 Shares

The only Christmas music I want to hear this year is The Muppets doing Carol of the Bells. Beaker, Animal, and the Swedish Chef makes a great trio, don’t you think?

Tags: Christmas · holidays · music · The Muppets · video

💬 Join the discussion on kottke.org

Read the whole story
huskerboy
3 days ago
reply
Seattle
Share this story
Delete

What Are Your Personal Foundational Texts?

1 Share

the book covers of Cars and Trucks and Things That Go, The Warmth of Other Suns, 1984, and The Death and Life of Great american Cities

Writer Karen Attiah recently wrote about the pleasure of perusing other people’s personal libraries and then asked her followers what their “personal foundational texts” were…those books that people read over and over again during the course of their lives. Here was her answer:

Herge’s The Adventures of Tintin were foundational books for me — and probably why I’m in journalism today.

Otherwise:

Autobiography of Malcolm X
Audre Lorde’s “Sister Outsider”
Howard French: A Continent for the Taking

And lately: Anaïs Nin’s diaries

And I haven’t re-read them in a long time, but Barbara Ehrenreich’ Nickel and Dimed” and Dambisa Moyo’s “Dead Aid” were paradigm shifting for me.

There are tons of good books mentioned in the replies and quote posts. One of the most faved answers features a book called They Thought They Were Free: The Germans, 1933–45, which I don’t think I’d ever heard of but sounds fascinating and unfortunately very relevant.

In thinking about the books I’ve read that made a significant impact on how I see and understand the world, I’d have to go with:

  • Various Richard Scarry books (like Cars and Trucks and Things That Go) when I was little, although Mister Rogers’ Neighborhood & Sesame Street probably had a bigger and more lasting impact on who I am as a person.
  • Where the Red Fern Grows was my favorite book as a child — I read it so many times. And there were these biography series for kids at my local library and I read a bunch of them. The two that I distinctly remember were the books on Thomas Edison and Harriet Tubman. From the Edison book I learned that a clever lad from the Midwest could make and invent wonderful things using your mind and your hands. And Harriet Tubman: she was straight-up a superhero and her story taught me all I needed to know about the truth of American slavery.
  • I first read Orwell’s 1984 in 1984, when I was 10 or 11. Probably affected my view of the world more than any other book.
  • As an adult, I’d say that A Natural History of the Senses, Nickel and Dimed, The Death and Life of Great American Cities, 1491, Chaos, A People’s History of the United States, and The Warmth of Other Suns have formed the backbone of my view of the world. There are probably a few others that I’m forgetting, but those are the biggies.

How about you? What are your personal foundational texts? Note that, as I understand it, these are not simply your favorite books, but the books that mean a lot to you and have been instrumental to your development as a human.

Tags: books · Karen Attiah

💬 Join the discussion on kottke.org

Read the whole story
huskerboy
3 days ago
reply
Seattle
Share this story
Delete

The Biggest Bomb in the World

1 Share

The largest nuclear weapon ever tested was Tsar Bomba, a 50-megaton device detonated by the Soviet Union in 1961. That made it “3,300 times as powerful” as the bomb dropped on Hiroshima — an almost unimaginable level of potential destructive power. But Tsar Bomba wasn’t even close to being the biggest nuclear weapon ever conceived. Meet Project Sundial, courtesy of Edward Teller, one of the inventors of the hydrogen bomb, and his colleagues at Los Alamos:

Only a few months later, in July 1954, Teller made it clear he thought 15 megatons was child’s play. At a secret meeting of the General Advisory Committee of the Atomic Energy Commission, Teller broached, as he put it, “the possibility of much bigger bangs.” At his Livermore laboratory, he reported, they were working on two new weapon designs, dubbed Gnomon and Sundial. Gnomon would be 1,000 megatons and would be used like a “primary” to set off Sundial, which would be 10,000 megatons.

10,000 megatons. In the video above, Kurzgesagt speculates that exploding a bomb of that size would result in a fireball “up to 50 kilometers in diameter, larger than the visible horizon”, a magnitude 9 earthquake, a noise that can be heard around the entire Earth, a 400 km in which everything is “instantly set on fire – every tree, house, person”, and, eventually, the deaths of most of the Earth’s population.

Sundial would bring about an apocalyptic nuclear winter, where global temperatures suddenly drop by 10°C, most water sources would be contaminated and crops would fail everywhere. Most people in the world would die.

Fun fact: Edward Teller was one of Stanley Kubrick’s inspirations for the bomb-giddy character of Dr. Strangelove in the 1964 film of the same name.

Tags: atomic bomb · Edward Teller · Kurzgesagt · science · video

💬 Join the discussion on kottke.org

Read the whole story
huskerboy
3 days ago
reply
Seattle
Share this story
Delete

10 TV Shows Everyone Loves That Are Actually Bad

1 Share
Popular does not always equal good. Continue reading…
Read the whole story
huskerboy
52 days ago
reply
Seattle
Share this story
Delete

My OnlyFans Was a Fun Way for Me to Make Money. Then My Content Got Stolen.

1 Share
Leaks result in financial loss, jeopardize their privacy and safety, and create an ongoing nightmare.

Read the whole story
huskerboy
52 days ago
reply
Seattle
Share this story
Delete
Next Page of Stories