2808 stories
·
0 followers

How to Prevent and Resolve Incidents Using Model Context Protocol (MCP) by Hannah Culver

1 Share

The rapid pace of modern software development, fueled by AI-driven coding and accelerated deployment cycles, has resurfaced a challenge that many development teams already struggled with: the speed of incident response must now match the speed of change. Every day, teams ship code faster than ever, which inevitably increases the risk of a new issue making it to production. The traditional approach—where engineers waste time jumping between disconnected tools—is no longer sustainable. It burns developers out and takes them out of a flow state. The solution is an interconnected AI ecosystem that leverages the operational data you already own.

PagerDuty is contributing to this interconnected AI ecosystem via Model Context Protocol (MCP), a standardized way for specialized AI tools to securely exchange information and actions. MCP acts as the common language, allowing AI tools and agents to talk directly to other tools and agents. 

To date, we have over 60 tools that allow users to pull in critical incident data, service information, and even trigger automated responses in any AI-enabled tool of choice. And we’re always adding more tools (check out our release notes). We plan to build out MCP parity with our open APIs, meaning that all the critical PagerDuty data and actions available via API will be available via MCP. The best part? This data can be used during incidents, or when coding. Let’s look at how a flow could work for each of these scenarios.

Preventing Incidents with the Right Data at the Right Time

Imagine you’re creating a new agent, perhaps one that guides your users through a check-out experience, offering deals they should add to their cart before the final purchase. This agent then could share this information back with other teams that are looking for demand signals on popular products. It’s critical that this agent both provides a good customer experience (relevant suggestions, works as intended) and that it relays the correct final purchase information back to internal teams. Now, imagine that you’re making a small tweak to this agent that should allow the user to rate the helpfulness of the agent’s suggestion. Let’s prevent a potential incident.

  1. Building Safer Agents with Past Incident Knowledge with LangSmith

The PagerDuty Incident Responder agent for LangSmith connects to the PagerDuty MCP server, accessing a service’s incident history and context. Developers can input a service name (such as the one this new agent is associated with), incident links for previous incidents with this agent, or symptom description from past failure. In response, PagerDuty will provide critical details that help developers assess risk: past incidents, triage information, and known failure modes discovered in post-incident reviews. This helps a developer prepare for a deploy with the right data at the right time.

  1. Scoring Code Risk Before Deployment with Claude Code

Developers who code in Claude Code can also score the risk of the uncommitted code changes right in their development workflow as another safety mechanism. The PagerDuty Plug-in for Claude Code is a risk scoring tool that brings production context directly into the development process. When a developer runs a simple command like /risk-score, Claude analyzes the new code against 90 days of PagerDuty incident data. The analysis identifies high-risk file types, the extent of the change, and whether it overlaps with areas that have caused past incidents. The developer then receives a clear risk score and actionable recommendations before the code is committed, helping to reduce the risk and cost of major operational failures.

  1. Checking System Health Before Deployment with GitHub Copilot

The PagerDuty Incident Responder custom agent for GitHub gives users access to PagerDuty data, including change correlation and incident data, directly within GitHub Copilot. Additionally, developers can build their own custom agents using PagerDuty MCP tools that offer even broader sets of data and actions. Users can quickly review what is currently happening in the system, ask about previous incidents on the service, and even summarize post-incident review notes. This can flag any concerns that may warrant postponing a deployment.

Accelerating Response During an Incident

The reality is that not every incident can be stopped, especially with the accelerated rate of shipping new code. When an incident does occur, MCP helps teams recover faster by reducing disruption and the cognitive load of having to jump between different tools. Let’s use our new agent example. Say the developer pushing the change to add the rating system skipped the review process above, and an issue slipped through the cracks. Here’s how MCP can make response smoother.

  1. Acknowledge and Review in Cursor

When a new alert fires, you can immediately acknowledge and review it without leaving your coding tool. The PagerDuty MCP Integration with Cursor allows Cursor to pull in PagerDuty data or execute actions, including who is currently on-call, service status details, and incident history. This can help a developer answer key questions and begin triage, asking questions to PagerDuty about incident impact and services, any notes that are pre-populated, and more. Without context switching, a user could also ask GitHub Copilot about recent changes, bringing that information in-line with the critical PagerDuty data without ever leaving their tool of choice.

  1. Automated Diagnostics and Suggested Fixes with Honeycomb data

While a developer is reviewing the issue, the PagerDuty SRE Agent is running diagnostics in the background. PagerDuty will be extending its SRE Agent to use logging and metrics data from Honeycomb via MCP. The SRE Agent will use this critical telemetry to inform triage, quickly determine the root cause, and execute more pointed automation, taking the initial diagnostic burden off the human responder. For example, the agent can quickly suggest a fix, like rolling back a recent change.

  1. Quick Fix and Resolution

Thanks to this seamless flow of information, the responder can then go back to Cursor to take the suggested action—like rolling back the change. This unified, intelligent workflow quickly closes the loop from alert detection to resolution without pushing a user to a different surface. Response is faster, and developers can get back to building with less time spent on interrupt work.

By connecting data and actions from tools like LangSmith, Claude, GitHub Copilot, Cursor, Honeycomb, and more, PagerDuty is making the right data and actions accessible exactly where teams need it. This approach helps reduce friction, accelerate incident management to match the pace of AI-driven development, and ultimately gives developers more time back for higher-value work. We are only scratching the surface of what is possible with MCP.

Want to learn more about PagerDuty’s approach to MCP? Join our twitch stream here.

Want to contribute to our repo? Check out our GitHub repo.

The post How to Prevent and Resolve Incidents Using Model Context Protocol (MCP) appeared first on PagerDuty.

Read the whole story
huskerboy
1 day ago
reply
Seattle
Share this story
Delete

A printable zine: 50 Ways To Meet Your Neighbor ....

1 Share

A printable zine: 50 Ways To Meet Your Neighbor. “32. Picking up trash, generally, is a good way to meet neighbors. People notice. 33. Winter: Shovel someone’s sidewalk. It’s also great cardio.”

Read the whole story
huskerboy
8 days ago
reply
Seattle
Share this story
Delete

TIL about burping your house , aka lüften (in Germany),...

1 Share

TIL about burping your house, aka lüften (in Germany), aka opening up the windows in your house daily to air it out, even in winter.

Read the whole story
huskerboy
8 days ago
reply
Seattle
Share this story
Delete

Army suspends 2 helicopter crews that flew near Kid Rock's house in Nashville

1 Share
The crews of two AH-64 Apache helicopters that hovered next to Kid Rock's swimming pool while he clapped and saluted on Saturday have been suspended from flying pending a investigation of their actions, a U.S. Army spokesperson said on Tuesday.

Read the whole story
huskerboy
8 days ago
reply
Seattle
Share this story
Delete

The Gentle Romance / Career Dreamer / Skull

1 Share

Rewarding sci-fi book

This collection of AI-related science-fiction short stories by Richard Ngo reminds me of the classic anthologies I read growing up during the golden age of science fiction. They are hard sci-fi, with technically plausible scenarios, played out many levels deep in very consistent worlds, explored by a very fertile imagination. I found more insights per page in Ngo’s The Gentle Romance than in any other book I’ve read for a long while. — KK

Career Dreamer career map

Google’s Career Dreamer tool has been around for a bit, but it’s recently been updated with more AI support and feels worth returning to if you’re in a career‑questioning season. It asks for your past roles, skills, and interests, and then reflects back possible career paths, related titles, and a “career identity statement” you can lift language from for your resume or LinkedIn. I like using it as a way to see how my existing experience could stretch into adjacent roles I hadn’t named yet. If you land on a path that involves freelancing or consulting, this hourly rate calculator is a good tool for discovering what people in similar roles are actually charging. — CD

Fun bluffing game

My daughter introduced me to Skull, a fun, fast-paced tabletop bluffing game for 3-6 people. Each player gets three rose cards and one skull card. Players take turns laying cards face down until one player announces they can turn over a specified number of flower cards from their own and the other players’ cards. Bidding continues until the others pass. If the high bidder turns over a skull, they lose the round; otherwise, they win. It takes about two minutes to learn, but the bluffing gets deviously deep. The coaster-like cardboard pieces feel great in your hands, and the artwork is beautiful. — MF

Other Mona Lisas

I like this fun list of what different places call “our Mona Lisa.” It’s not just museums or galleries. It includes single objects treated like sacred centerpieces by retail brands, jewelers, and more. I love the idea that any household can have its own Mona Lisa—something everything else seems to orbit around. — CD

Phone ring hack

Like many people I keep my phone ringer on vibrate, but I don’t usually carry my phone on me – I may leave it on a desk – so I often miss calls. I’ve greatly reduced missed calls by setting the phone to flash its flashlight and flash its screen while it vibrates. That flashing light is enough to notice from a distance. It is easy to program on the iPhone. Go Settings > Accessibility > Audio Visual > Flash for Alerts. For Android: Settings > Accessibility > Audio & Screen Text > Flash Notifications. — KK

Free encyclopedia of ancient design patterns

In 1930, pioneering archaeologist Sir Flinders Petrie published Decorative Patterns of the Ancient World, cataloging over 3,000 ornamental motifs — spirals, animals, rosettes, braids, crosses, and more — drawn from ancient civilizations across Europe and the Near East up to about 1000 AD. The entire book is free to browse and download on the Internet Archive, making it an incredible reference for artists, designers, crafters, and anyone looking for authentic, copyright‑free historical patterns to use in their work. The simple black‑and‑white line drawings make the motifs easy to trace, digitize, or adapt. Used copies of an out-of-print Dover paperback are also available. — MF


Sign up here to get Recomendo a week early in your inbox.

Read the whole story
huskerboy
10 days ago
reply
Seattle
Share this story
Delete

SRE Weekly Issue #510

1 Share

A message from our sponsor, Clickhouse:

AI isn’t replacing SREs. It’s changing how they work.

The near future of observability isn’t autonomous agents, it’s collaboration. ClickHouse’s ClickStack Notebooks bring SREs and AI into a shared investigative workspace, combining human intuition with structured, reliable tooling to debug faster and think more clearly.

Read more

ML systems decay gradually instead of breaking suddenly, so we need error budgets for model accuracy, data freshness, and fairness — not just uptime.

   Varun Kumar Reddy Gajjala — DZone

Enterprises rarely fail because they don’t care about reliability.
They fail because:

  • failure is loud,
  • prevention is quiet,
  • and budgeting systems are wired to respond to noise.

  Florian Hoeppner

They had hundreds of databases to migrate, so they built a tested, self-service migration workflow.

  Ram Srivasta Kannan, Wale Akintayo, Jay Bharadwaj, John Crimmins, Shengwei Wang, and Zhitao Zhu — Netflix

I love the technical description of socket juggling to achieve a graceful restart. I could swear that this technique has been around for decades though, for example in TinyMUX et al…

  Manuel Olguín Muñoz — Cloudflare

Lorin goes into what an AI incident manager might look like, since no tools of the sort exist yet.

  Lorin Hochstein

By default, Kubernetes keeps a pretty short event history. This article argues that what we really need is the ability to know the state of the system at a specific time.

   Shamsher Khan — DZone

They built a platform for safely rolling out configuration changes. I like that it has a special mode for use in incident response.

  Cosmo W. Q — Airbnb

This is a cool debugging story, and I love the emphasis on mental models. The bit about simulating different paths through the software is quite intriguing.

  Michael Victor Zink — Readyset (via Antithesis)

Read the whole story
huskerboy
10 days ago
reply
Seattle
Share this story
Delete
Next Page of Stories