LLM-Human Interaction Design Patterns for Operations

Designing the Seam Between AI Agents and Human Operators

April 2026

By Robert Barcik

LearningDoe s.r.o.

About This Guide

This guide addresses the most consequential and least discussed design decision in operational AI: the interaction boundary between AI agents and human operators. The question is no longer whether to deploy large language models in operations – that ship has sailed, with organizations like Splunk reducing security investigation times from 90 minutes to 60 seconds per alert, and Dynatrace reporting 56% faster mean time to resolution through autonomous remediation. The question is how to design the handoff so that the human-AI team outperforms either component alone.

The answer, it turns out, is neither obvious nor purely technical. Decades of research in aviation, healthcare, cybersecurity, and industrial control have produced a rich body of evidence on what happens when humans interact with automated systems – and much of it is cautionary. Pilots who cannot hand-fly when the autopilot disconnects. Nurses who override 90% of medication alerts. Security analysts who leave 63% of daily alerts unaddressed. Pipeline operators who dismiss SCADA alarms for 17 hours while 3.3 million liters of crude oil leak into a river. These are not failures of automation. They are failures of interaction design.

This guide synthesizes that evidence into actionable design patterns for engineers building AI-assisted operational systems. It draws on the Sheridan-Verplank automation taxonomy, the Parasuraman-Sheridan-Wickens four-stage model, Endsley’s Situation Awareness framework, Klein’s Recognition-Primed Decision model, and production deployments at GitHub, PagerDuty, Splunk, Dynatrace, and ServiceNow. Each pattern is grounded in specific numbers, specific case studies, and specific design decisions that you can apply to your own systems.

Who This Guide Is For

GenAI engineers building operational AI systems with tool-use capabilities (MCP, ADK, function calling) who need to design the interaction layer between their agents and human operators
IT operations managers introducing AI agents into incident response, monitoring, or service desk workflows and seeking evidence-based guidance on autonomy levels
Product managers designing AI-assisted workflows who must balance automation efficiency against human oversight and accountability
Security operations professionals deploying AI triage and investigation tools in SOC environments where alert fatigue and missed detections carry real consequences
Anyone deploying AI that makes recommendations to humans in contexts where the cost of a wrong decision is measured in dollars, downtime, or safety

How to Read This Guide

Chapters 1 and 2 establish the structural foundation: what the design seam is, why it matters, and the five core interaction patterns that govern how AI agents hand off to human operators. Chapters 3 through 5 address the human side of the equation – the cognitive biases, communication frameworks, and trust dynamics that determine whether a well-designed system actually works in practice. Chapter 6 bridges theory to practice with prompt templates, architecture patterns, and a self-assessment worksheet. Chapters 7 and 8 cover failure modes and organizational governance. Chapter 9 synthesizes the preceding material into a decision framework.

You can read the guide sequentially or jump to the chapter most relevant to your current design challenge. Each chapter is self-contained, with cross-references where concepts build on earlier material.

The Design Seam
Five Structural Patterns
The Psychology of Handoff
Context Presentation
Trust Calibration
Implementing the Patterns
Designing for Failure
Organizational Governance
Conclusion

Chapter 1

Chapter 1: The Design Seam

Every AI agent that interacts with a human operator creates a seam – a boundary where machine cognition hands off to human judgment. This seam is not a bug to be eliminated or a formality to be minimized. It is the single most consequential design decision in any AI-assisted operational system, and getting it wrong has, in documented cases, cost billions of dollars and hundreds of lives.

Why This Matters Now

For most of the history of large language models, the interaction pattern was straightforward: a human typed a prompt, and the model returned text. The human was always in the loop because the human was the loop. The model could not act on the world – it could only suggest.

That constraint has dissolved. The Model Context Protocol (MCP) gives LLMs structured access to external tools and data sources. The Agent Development Kit (ADK) provides frameworks for building autonomous agents that can plan, execute, and iterate. Function calling enables LLMs to invoke APIs, modify databases, restart services, and deploy code. What was once a text-completion engine is now an autonomous actor capable of taking consequential actions in production environments.

This shift – from LLMs-as-tools to LLMs-as-agents – changes the design problem fundamentally. When an LLM can only recommend, a poor recommendation costs nothing until a human acts on it. When an LLM can execute, a poor decision costs everything the moment it is made. The seam between human and machine is no longer a UX nicety. It is a control surface.

The numbers confirm the urgency. GitHub Copilot now handles 1 in 5 code reviews, with over 60 million reviews processed across more than 12,000 organizations. PagerDuty’s SRE Agent autonomously triages and remediates production incidents. Splunk’s Agentic SOC investigates security alerts with minimal human involvement. ServiceNow deploys over 300 AI Skills across 30+ modules for IT service management. These are not prototypes. These are production systems making decisions that affect uptime, security, and revenue at scale.

And yet the interaction design – the seam – often receives less attention than the model architecture, the prompt engineering, or the tool integration. This is a mistake with well-documented precedents.

The Fundamental Tension

The core challenge of human-AI interaction in operations is a tension that cannot be resolved, only managed: too much autonomy removes the human oversight that catches errors, while too much oversight defeats the purpose of automation and introduces its own failure modes.

This tension is not new. In 1983, Lisanne Bainbridge published “The Ironies of Automation,” a paper that has proven almost prophetically relevant to the age of AI agents. Bainbridge identified a paradox that sits at the heart of every automation design decision:

The more reliable an automated system becomes, the less frequently humans need to intervene. The less frequently humans intervene, the less practice they get. The less practice they get, the less capable they are of intervening effectively when the automation fails. And the more reliable the system, the more complacent the human becomes, the less they monitor, and the less likely they are to detect a failure in time to act.

What Happens When the Seam Fails

Two cases from aviation illustrate the two fundamental failure modes – and both map directly to AI agent design.

Air France Flight 447 (2009) demonstrated handoff execution failure. When the autopilot disconnected over the Atlantic due to unreliable airspeed data, the pilots – who had spent the vast majority of their flight hours monitoring automation – were suddenly required to hand-fly the aircraft in degraded conditions. Their manual flying skills and instrument interpretation abilities had atrophied through disuse. The pilots never diagnosed the aerodynamic stall. All 228 aboard died. The investigation found that automation reliability had eroded the very skills needed when automation failed.

Boeing 737 MAX (2018-2019) demonstrated handoff design failure. The MCAS system relied on a single angle-of-attack sensor, was not mentioned in pilot training materials, and when it activated erroneously, the override procedure was neither obvious nor well-practiced. Pilots fought the automation but could not effectively override it. Three hundred and forty-six people died across two crashes because the seam was designed in a way that made effective human intervention nearly impossible.

Key distinction: AF447 was a failure of the human at the seam – the automation worked correctly by disconnecting, but the humans could not perform. Boeing 737 MAX was a failure of the seam itself – the automation prevented effective human oversight. Both failure modes are directly relevant to AI agent design: your operators may lack the skills to override your agent (AF447), or your agent may be designed in a way that makes override impractical (737 MAX).

Situation Awareness at the Seam

Mica Endsley’s Situation Awareness model (1995) explains why these failures are predictable. SA operates at three levels: perception (seeing the data), comprehension (understanding what it means), and projection (anticipating what happens next). Automation’s most insidious effect is on comprehension – operators can see the outputs but lose the contextual understanding that makes those outputs meaningful.

This is directly relevant to AI agents. An LLM agent that autonomously investigates an incident and presents a summary is asking the operator to exercise projection and decision-making without having gone through perception and comprehension. The operator must decide based on a summary they did not construct, using context they did not gather, about a system state they did not observe. Without deliberate design support, the operator defaults to either rubber-stamping (automation bias) or second-guessing everything (automation distrust).

Defining the Design Seam

The design seam is the complete set of decisions that govern how an AI agent and a human operator interact at their boundary:

What the agent does autonomously versus what it refers to the human
How the agent communicates its findings, recommendations, and confidence levels
What information the human receives to evaluate the agent’s output
How much time the human has to make a decision
What controls the human has to override, modify, or roll back the agent’s actions
How the system degrades when the agent fails, the human errs, or communication breaks down

Each of these decisions shapes the interaction in ways that compound over time. A system that presents recommendations without confidence levels trains operators to trust or distrust uniformly. A system that allows autonomous action without rollback mechanisms creates irreversible consequences from reversible errors. A system that presents too much information per decision creates the cognitive overload that leads to alert fatigue and rubber-stamping.

Key insight: The goal is not to eliminate the seam. It is to design it so that the human-AI team outperforms either component alone. This requires treating the seam not as a technical interface but as a sociotechnical system where human cognition, organizational context, and system architecture interact.

The next chapter introduces the five structural patterns that define how AI agents and human operators divide responsibility in operational workflows.

Chapter 2

Chapter 2: Five Structural Patterns

Not all AI-human handoffs are alike. The appropriate pattern depends on the risk of the action, the time available, and the expertise of the operator. A security analyst triaging thousands of alerts per day needs a fundamentally different interaction pattern than a site reliability engineer approving a database failover. An IT service desk agent resolving password resets operates under different constraints than a compliance officer reviewing AI-generated audit findings.

This chapter defines five structural patterns that cover the full spectrum of human-AI interaction in operations, maps them to established automation taxonomies, and provides a decision framework for selecting the right pattern for a given operational context.

The Sheridan-Verplank Foundation

Before examining the patterns, it is worth grounding them in the taxonomy that has structured automation research for nearly five decades. In 1978, Thomas Sheridan and William Verplank proposed a 10-level scale of automation, ranging from full human control to full machine autonomy. Their framework remains the most widely cited reference point for automation design, and every modern framework – including those from PagerDuty, the Cloud Security Alliance, and NIST – can be mapped back to it.

Level	Description	Operational Example
1	The computer offers no assistance; the human does everything	Manual log analysis with grep and text editors
2	The computer offers a complete set of action alternatives	AI lists all possible root causes for an alert
3	The computer narrows the selection down to a few alternatives	AI identifies the 3 most likely root causes with supporting evidence
4	The computer suggests one alternative	AI recommends a specific remediation action
5	The computer suggests one alternative and executes it if the human approves	AI recommends rolling back a deployment and pre-stages the rollback command
6	The computer allows the human a restricted time to veto before automatic execution	AI will auto-scale infrastructure in 60 seconds unless the operator cancels
7	The computer executes automatically, then informs the human	AI auto-remediates a known issue and posts a summary to the incident channel
8	The computer executes automatically and informs the human only if asked	AI silently handles routine certificate renewals; status available in dashboard
9	The computer executes automatically and informs the human only if it decides to	AI resolves issues autonomously and only alerts humans for novel failure modes
10	The computer decides everything, acts autonomously, ignores the human	Fully autonomous system with no human interface (rarely appropriate in operations)

The five patterns described below map to clusters within this scale, but they are defined by operational characteristics rather than by abstract automation levels. They answer the practitioner’s question: “How should my AI agent interact with my human operators for this specific type of work?”

Pattern 1: Recommend and Wait

Sheridan-Verplank Levels 4-5 | The AI recommends; the human decides and acts.

In this pattern, the AI agent analyzes the situation, gathers evidence, and presents a single recommended action to the human operator. The agent then waits. No action is taken until the human explicitly approves, modifies, or rejects the recommendation.

This is the safest pattern and the appropriate default for any action where the consequences of an error are significant and irreversible.

PagerDuty SRE Agent

PagerDuty’s SRE Agent exemplifies this pattern in production incident response. When an alert fires, the agent automatically gathers context: it pulls recent deployment history, queries monitoring dashboards, checks for correlated alerts across services, and examines relevant runbooks. It then presents the on-call engineer with a synthesized assessment and a recommended action – for example, “Roll back deployment v2.4.7 to v2.4.6. Evidence: error rate increased 340% within 8 minutes of deployment, correlated with this commit changing the database connection pooling configuration.”

The engineer reviews the recommendation, examines the evidence, and either approves the rollback or investigates further. The agent does not execute the rollback autonomously. This is deliberate: production rollbacks can have cascading effects, and the engineer’s contextual knowledge – awareness of an ongoing data migration, knowledge that v2.4.6 had its own issues, recognition that the error rate spike might be a measurement artifact – is essential to the decision.

Johns Hopkins Sepsis AI

In healthcare, Johns Hopkins deployed an AI system for early sepsis detection that operates squarely in the Recommend and Wait pattern. The system continuously monitors patient vitals and laboratory results, using machine learning to identify the subtle early indicators of sepsis that human clinicians frequently miss. When the system detects a high-probability case, it alerts the clinical team with a recommended treatment protocol.

The results are striking: an 82% catch rate for sepsis cases, with a 20% reduction in mortality among patients flagged by the system. The system does not administer treatment. It does not order labs. It recommends, and the clinical team – with their knowledge of the patient’s history, comorbidities, and current treatment plan – decides.

Key insight: Recommend and Wait is not a conservative fallback. It is a high-performance pattern when the AI’s analysis is genuinely valuable but the human’s contextual knowledge is essential to the final decision. The 20% mortality reduction at Johns Hopkins was achieved entirely through better recommendations, not through autonomous action.

When to Use This Pattern

The action is irreversible or expensive to reverse (production deployments, security blocks, patient treatments)
The human operator has domain expertise that the AI cannot fully capture (organizational context, recent conversations, political considerations)
Regulatory or compliance requirements mandate human approval
The AI system is newly deployed and trust has not yet been established

Pattern 2: Triage and Escalate

Sheridan-Verplank Levels 3-5 | The AI filters, prioritizes, and routes; the human handles what remains.

In this pattern, the AI agent processes a high-volume stream of inputs – alerts, tickets, requests – and performs initial triage. It classifies items by severity and type, filters out noise, enriches items with relevant context, and routes them to the appropriate human operator or team. The human works from a curated, prioritized queue rather than a raw feed.

This pattern is most valuable in environments where the volume of inputs overwhelms human processing capacity.

Splunk Agentic SOC

The scale of the problem in security operations is staggering. Research consistently shows that the average SOC processes 2,992 security alerts per day, of which 63% go entirely unaddressed. Analysts spend an average of 90 minutes investigating each alert that they do examine. The arithmetic is brutal: even with a full team, the majority of alerts receive no human attention at all.

Splunk’s Agentic SOC addresses this by deploying AI agents that perform the initial investigation autonomously. When an alert fires, the agent queries relevant data sources (SIEM logs, endpoint telemetry, threat intelligence feeds), correlates the alert with known attack patterns, checks for false positive indicators, and produces a structured investigation summary. What previously took an analyst 90 minutes is completed in 60 seconds.

The agent does not decide whether the alert represents a real threat. It presents the analyst with a structured brief – including the alert details, correlated evidence, historical context, and a preliminary assessment – and the analyst makes the determination. But critically, the agent also assigns a priority score, ensuring that the most likely genuine threats surface first. Analysts work from the top of a prioritized queue rather than from a chronological feed.

ServiceNow AI Agents

ServiceNow has taken the Triage and Escalate pattern to enterprise scale with its Now Assist platform, deploying over 300 AI Skills across more than 30 modules. In IT service management, AI agents automatically classify incoming tickets, extract key information, identify relevant knowledge base articles, and route tickets to the appropriate resolution group.

For straightforward requests – password resets, access provisioning, standard software installations – the agent may resolve the ticket autonomously (shifting into an Execute and Report pattern). For complex or ambiguous issues, it enriches the ticket with diagnostic information and escalates to a human agent who receives a pre-investigated case rather than a raw complaint.

When to Use This Pattern

Input volume exceeds human processing capacity (thousands of alerts or tickets per day)
The majority of inputs are routine, false positive, or low-priority
The cost of delayed response to high-priority items is significant
Human expertise is the bottleneck and must be focused on the highest-value work

Pattern 3: Execute and Report

Sheridan-Verplank Levels 7-8 | The AI acts autonomously and informs the human afterward.

In this pattern, the AI agent takes action without waiting for human approval, then reports what it did. The human reviews the action after the fact and intervenes only if something went wrong. This pattern is appropriate only when three conditions are met: the action is well-understood, the action is reversible, and the cost of delay exceeds the cost of occasional errors.

Dynatrace Davis AI

Dynatrace’s Davis AI engine operates at the Execute and Report level for a defined set of remediation actions. When Davis detects a performance anomaly – say, a memory leak causing response time degradation in a microservice – it can automatically trigger a remediation action, such as disabling a problematic feature flag, scaling up a resource, or restarting a container.

The results are quantifiable: organizations using Davis AI’s autonomous remediation report 56% faster mean time to resolution compared to human-only workflows. The system executes the remediation, logs the action with full context (what was detected, what action was taken, what the expected and actual outcomes were), and notifies the operations team.

Critically, Davis AI does not auto-remediate everything. The system maintains an explicit list of approved autonomous actions, each with defined rollback procedures. Actions outside this list are escalated to the Recommend and Wait pattern. This bounded autonomy – executing autonomously within defined guardrails, escalating outside them – is what makes the pattern safe at Sheridan Level 7 rather than reckless at Level 10.

When to Use This Pattern

The action is well-understood and has been successfully executed many times before
The action is reversible within an acceptable time window
The cost of delay (human approval latency) exceeds the expected cost of occasional errors
Comprehensive logging and rollback mechanisms are in place
The scope of autonomous action is explicitly bounded and regularly reviewed

Key distinction: Execute and Report is not “set and forget.” It requires more engineering investment than Recommend and Wait – not less – because the system must include monitoring of its own actions, automated rollback capabilities, and clear escalation paths for when autonomous remediation fails or produces unexpected results.

Pattern 4: Draft and Refine

Sheridan-Verplank Level 5 (adapted) | The AI produces a complete artifact; the human reviews, edits, and approves.

This pattern differs from Recommend and Wait in a subtle but important way. Rather than recommending an action, the AI produces a complete work product – a code review, an incident report, a runbook update, a configuration change – that the human then refines. The human’s role shifts from decision-maker to editor.

GitHub Copilot Code Review

GitHub Copilot’s code review capability provides the most scaled example of this pattern in production. As of early 2026, Copilot handles 1 in 5 code reviews on the platform, with more than 60 million reviews processed across over 12,000 organizations.

The interaction pattern is instructive. When a pull request is submitted, Copilot analyzes the changes, identifies potential issues (bugs, security vulnerabilities, style violations, performance concerns), and generates review comments with specific suggestions. The developer – or the pull request author – reviews these comments, accepts the ones that are valid, dismisses the ones that are not, and may engage in a back-and-forth with Copilot to refine specific suggestions.

WEX, a financial technology company, reported that teams using Copilot code review shipped 30% more code – not because the AI wrote more code, but because the review cycle was faster and more consistent. The AI handled the routine checks (style, common bug patterns, documentation gaps), freeing human reviewers to focus on architectural decisions, business logic correctness, and edge cases that require domain expertise.

When to Use This Pattern

The output is a complex artifact (code, documentation, configuration) rather than a binary decision
Quality depends on iterative refinement rather than a single correct answer
The human’s expertise is in evaluation and editing rather than generation from scratch
The volume of artifacts exceeds what humans can produce from scratch but not what they can review

Pattern 5: Graduated Autonomy

Dynamic across Sheridan-Verplank levels | The AI’s autonomy level adjusts based on context, confidence, and track record.

This is the meta-pattern: rather than fixing a single interaction pattern, the system dynamically adjusts the level of autonomy based on the specific situation. An AI agent might operate at Execute and Report for routine, well-understood issues, shift to Recommend and Wait for novel or high-risk situations, and escalate to full human control when it encounters something outside its training distribution.

PagerDuty Three-Tier Framework

PagerDuty’s SRE Agent implements graduated autonomy through a three-tier framework that maps directly to operational risk:

Tier	Autonomy Level	Example Actions	Human Role
Tier 1: Automated Response	Execute and Report	Restarting a crashed container, scaling a resource, clearing a cache	Post-hoc review
Tier 2: Guided Response	Recommend and Wait	Rolling back a deployment, failing over a database, modifying network rules	Approval required
Tier 3: Collaborative Investigation	Triage and Escalate	Novel failure modes, multi-system cascades, potential security incidents	Active investigation with AI assistance

The tier assignment is not static. An action that begins as Tier 2 (Recommend and Wait) may be promoted to Tier 1 (Execute and Report) after it has been successfully recommended and approved 50 times without incident. Conversely, an action that is normally Tier 1 may be temporarily downshifted to Tier 2 during a change freeze, after a major incident, or when the AI’s confidence score falls below a threshold.

CSA Autonomy Levels

The Cloud Security Alliance published its AI Autonomy Levels framework in January 2026, defining six levels specifically for AI agents in security operations:

CSA Level	Name	Description	Key Characteristic
0	No AI	Fully manual operations	Baseline
1	Assistive AI	AI provides information; human decides and acts	Copilot mode
2	Supervised Autonomy	AI recommends actions; human approves	Recommend and Wait
3	Conditional Autonomy	AI acts within defined boundaries; human handles exceptions	Bounded Execute and Report
4	High Autonomy	AI acts independently for most tasks; human oversees	Execute and Report with monitoring
5	Full Autonomy	AI operates independently with minimal human involvement	Rarely appropriate for security

The most significant contribution of the CSA framework is its concept of dynamic downshifting: the principle that an AI agent should automatically reduce its autonomy level when it encounters uncertainty, novel situations, or conditions outside its training distribution. A Level 4 agent that encounters a previously unseen attack pattern should downshift to Level 2, presenting its analysis and asking for human guidance rather than attempting autonomous remediation of something it does not understand.

Key insight: Graduated autonomy is not about achieving the highest possible autonomy level. It is about achieving the right autonomy level for each specific decision at each specific moment. The best systems are not the most autonomous – they are the ones that know when to ask for help.

Pattern Selection Framework

Choosing the right pattern requires evaluating four dimensions of the operational context:

Pattern	Risk Tolerance	Time Sensitivity	Human Expertise Required	Reversibility
Recommend and Wait	Low (high-consequence actions)	Low to moderate (minutes to hours available)	High (contextual judgment essential)	Low (irreversible or costly to reverse)
Triage and Escalate	Moderate (prioritization errors are recoverable)	High (volume demands fast processing)	Moderate (expertise needed for escalated items)	Moderate (routing errors delay but don’t prevent resolution)
Execute and Report	Moderate to high (accepts occasional errors)	Very high (delay cost exceeds error cost)	Low (actions are well-understood and procedural)	High (actions must be reversible)
Draft and Refine	Moderate (editing catches most errors)	Moderate (review cycle adds latency)	High (evaluation requires deep expertise)	High (artifacts can be revised before deployment)
Graduated Autonomy	Variable (adapts to context)	Variable (adapts to urgency)	Variable (adjusts to availability)	Variable (matches autonomy to reversibility)

Reference Frameworks

The five patterns described in this chapter draw on and are compatible with several established frameworks that practitioners should be aware of:

Parasuraman, Sheridan, and Wickens (2000)

The four-stage model extends the original Sheridan-Verplank scale by recognizing that automation can be applied independently to four stages of human information processing: information acquisition, information analysis, decision selection, and action implementation. A system might be highly automated in information acquisition (automatically gathering logs and metrics) while remaining fully manual in decision selection (the human decides what to do). This decomposition is essential for designing nuanced interaction patterns that automate the right stages for the right reasons.

NIST AI Risk Management Framework

NIST AI RMF provides a structured approach to identifying and mitigating risks in AI systems, organized around four functions: Govern, Map, Measure, and Manage. It does not prescribe specific interaction patterns but provides the risk assessment methodology that should inform pattern selection.

Microsoft Human-AI Experience (HAX) Guidelines

Microsoft’s 18 HAX Guidelines address the full lifecycle of human-AI interaction, from initial calibration (“Make clear how well the system can do what it can do”) to error handling (“Support efficient correction”) to long-term trust (“Encourage granular feedback”). They are particularly useful for the UX layer of seam design.

Google PAIR (People + AI Research)

Google’s PAIR Guidebook provides design guidance organized around the concept of “AI-first” design – starting from the AI’s capabilities and limitations rather than from a traditional UX workflow. Its emphasis on mental models (helping users understand what the AI can and cannot do) aligns directly with the situation awareness concerns discussed in Chapter 1.

Choosing Your Starting Point

For organizations beginning to deploy AI agents in operations, two practical recommendations:

Start with Recommend and Wait. It is the safest pattern, it builds the data needed to evaluate the AI’s performance, and it establishes the trust foundation required for higher autonomy levels. Organizations that skip directly to Execute and Report without first validating the AI’s recommendations in a Recommend and Wait mode are taking unnecessary risk.

Design for Graduated Autonomy from the beginning. Even if your initial deployment is purely Recommend and Wait, architect the system so that the autonomy level can be adjusted per action type without a redesign. Define the criteria for promotion and demotion. Instrument the system to track recommendation acceptance rates, override patterns, and outcome quality. The data you collect during Recommend and Wait is the foundation for every subsequent autonomy decision.

The structural patterns define what the system does. The next chapter examines what the human does – and more importantly, what the human fails to do – when interacting with these patterns.

Chapter 3

Chapter 3: The Psychology of Handoff

The most dangerous assumption in AI-assisted operations is that humans will behave rationally when interacting with automated systems. They will not. Not because operators are careless or incompetent, but because the human cognitive architecture that served us well for millennia is systematically mismatched to the demands of monitoring and overriding automated systems. Understanding these mismatches is not optional – it is a prerequisite for designing interaction patterns that actually work.

This chapter covers five cognitive phenomena that directly affect the quality of human decisions at the AI-human seam. Each has been extensively documented in peer-reviewed research. Each has produced real-world failures with measurable consequences. And each has design implications that, if ignored, will undermine even the most carefully engineered structural patterns from Chapter 2.

Automation Bias

Automation bias is the tendency of humans to favor suggestions from automated systems over contradictory information from other sources, including their own observations. It is not a tendency to be lazy. It is a well-documented cognitive shortcut: the human brain treats the automated system as an authority and adjusts its processing accordingly.

The Evidence

The landmark study is Skitka, Mosier, and Burdick (1999), which tested pilots and non-pilots in a simulated flight environment where an automated monitoring system occasionally provided incorrect recommendations.

The results were stark:

Commission errors (taking an incorrect action recommended by the automation): 100% of participants committed at least one commission error. Every single participant, including experienced pilots, followed the automation’s recommendation at least once when it was demonstrably wrong.
Omission errors (failing to notice problems the automation missed): 55% of participants missed events that the automation failed to flag, even when clearly visible on their instruments.

Perhaps most troubling: having a second crew member present – a standard mitigation for human error in aviation – did not reduce automation bias errors. Parasuraman and Manzey’s meta-analysis (2010) confirmed the pattern across multiple domains and added a critical finding: operators of high-reliability systems were 50% less likely to detect automation failures than operators of less reliable systems. The more trustworthy the automation’s track record, the less the human monitors it.

Real-World Consequences

The Enbridge pipeline rupture (2010) demonstrated automation bias at operational scale. SCADA alarms indicated a pressure drop consistent with a rupture. Control room operators, calibrated by years of false alarms, dismissed the warnings for 17 hours, twice restarting the pipeline and pumping additional oil into the environment. Cleanup exceeded $1 billion.

The UK Post Office Horizon scandal demonstrated it at institutional scale. The Horizon IT system contained bugs that created phantom financial shortfalls. Despite hundreds of sub-postmasters reporting the system’s figures didn’t match reality, the Post Office systematically trusted the computer over humans, resulting in 736 wrongful prosecutions over 16 years.

Key insight: Automation bias is not a character flaw. It is a predictable response to a poorly designed interaction. When a system is right 99% of the time, the rational Bayesian response is to trust it – and that same rational response will cause the operator to miss the 1% of cases where trust is misplaced. The design must account for this, not the operator.

Design Implications

Cognitive forcing functions – interface elements that require the operator to actively engage before accepting the AI’s recommendation – are the primary countermeasure. A Harvard CHI 2021 study demonstrated that requiring operators to state their own assessment before seeing the AI’s recommendation significantly reduced automation bias errors. The tradeoff: users found these systems less satisfying to use, creating a direct conflict between safety and usability that designers must navigate explicitly.

Alert Fatigue

Alert fatigue is the progressive desensitization of operators to alerts as a result of excessive volume, high false positive rates, or both. It is the complement of automation bias: instead of trusting the wrong recommendation, the operator ignores all recommendations because the signal-to-noise ratio has collapsed.

The Scale of the Problem

The numbers are consistent across industries:

Healthcare: 72-99% of clinical alarms are false (AHRQ, 2020). Clinicians override approximately 90% of medication alerts. ECRI Institute has documented at least 80 fatalities directly attributable to alarm fatigue.
Security Operations: The average SOC receives 2,992 security alerts per day, of which 63% go entirely unaddressed. Sophisticated attackers exploit this through “alert storming” – generating high volumes of low-priority alerts to mask genuine intrusions.
IT Operations: Similar patterns in infrastructure monitoring, where noisy alerting configurations generate hundreds or thousands of alerts per day, the majority transient or duplicative.

Evidence-Based Remediation

Alert fatigue is not intractable. Boston Medical Center redesigned its clinical alarm system with threshold adjustments, suppression of non-actionable conditions, and tiered notification routing. Alarm volume dropped from 87,829 per week to 9,967 – an 89% reduction – without any increase in adverse patient outcomes.

The lesson: the value of an alerting system is not proportional to its sensitivity. A system that generates 3,000 alerts per day and catches 95% of real incidents is less useful than one that generates 300 alerts and catches 90%, because the first system trains operators to ignore alerts.

Design Implications

For AI agents operating in the Triage and Escalate pattern, alert fatigue is the primary failure mode. Countermeasures:

Aggressive deduplication and correlation: Group related alerts into incidents.
Confidence-based filtering: Suppress alerts below a confidence threshold, accepting occasional misses to preserve operator attention.
Adaptive thresholds: Adjust based on context (time of day, recent changes, current incident load).
Alert budgets: Cap total daily escalations, forcing the system to prioritize.

The Anchoring Effect

Anchoring is the cognitive bias identified by Tversky and Kahneman (1974) in which an initial piece of information disproportionately influences subsequent judgments, even when the anchor is arbitrary or irrelevant. In AI-human interaction, the AI’s initial recommendation serves as a powerful anchor.

A 2025 study of 775 managers confirmed that anchoring effects persist even among experienced professionals in their domain of expertise, and even when participants were explicitly warned about anchoring bias before making their judgments. Experience and awareness reduce anchoring but do not eliminate it.

The design implication is direct: when an AI agent presents a recommendation first, the operator’s subsequent investigation is shaped by that framing. They are more likely to seek confirming evidence and less likely to pursue alternative hypotheses.

Design Implications

Consider-the-opposite: Explicitly prompt operators to consider alternative explanations before accepting the AI’s recommendation.
Data before recommendation: Present the raw data and context before revealing the AI’s recommendation, giving the operator an opportunity to form an independent assessment. More expensive in operator time but significantly reduces anchoring.

Complacency Drift

Complacency drift is the gradual erosion of vigilance that occurs when an automated system performs reliably over an extended period. Unlike automation bias (which operates at individual decisions), complacency drift operates at the level of sustained monitoring behavior, creating a widening gap between the oversight provided and the oversight assumed.

M/V Royal Majesty (1995)

The cruise ship M/V Royal Majesty ran aground near Nantucket with 1,509 people aboard because the ship’s GPS antenna cable had detached, causing the GPS to switch to dead reckoning. The system displayed a warning indicator. The bridge team did not notice – for 34 hours, the ship sailed on a progressively divergent course, drifting 17 nautical miles off track. Multiple independent indicators – radar, depth soundings, visual observations – contradicted the GPS position, but the crew had stopped cross-checking.

Closely related is skill degradation: the FAA has documented that 60% of aviation accidents involving pilot error included a lack of manual flying proficiency – skills that atrophied because autopilot handled the flying. In IT operations, this manifests when AI agents handle investigation and resolution for extended periods, and operators lose the diagnostic skills that escalation assumes they have.

The CIGI Agency Decay Model

The Centre for International Governance Innovation describes a four-stage organizational pattern: Experimentation (AI supplements human work) → Integration (AI becomes standard, independent analysis declines) → Reliance (AI is primary input, skills atrophy, new staff trained to work with AI, not without it) → Dependency (organization cannot function without AI, no fallback).

Key distinction: Complacency drift is not about individual operators making bad decisions. It is about organizational systems gradually losing their capacity for independent judgment. Countering it requires organizational interventions: periodic mandatory manual operation, challenge tickets with known outcomes, tracking approval-without-review rates, and simulation-based skill maintenance.

Bringing It Together

These five phenomena – automation bias, alert fatigue, anchoring, complacency drift, and skill degradation – are not independent. They interact and reinforce each other:

Alert fatigue increases automation bias (overwhelmed operators accept AI recommendations without scrutiny).
Complacency drift accelerates skill degradation (operators who stop monitoring closely also stop practicing the skills needed for effective monitoring).
Anchoring reinforces automation bias (the AI’s recommendation shapes thinking, making independent evaluation harder).
Diffusion of responsibility between human and AI enables complacency drift – as Bleher and Braun (2022) observed, “the human says ‘I followed the system’ and the vendor says ‘the human made the final call.’” When no one feels individually accountable, there is less motivation to maintain vigilance.

The structural patterns from Chapter 2 provide the skeleton of effective human-AI interaction. The cognitive phenomena in this chapter determine whether that skeleton supports a functional system or an empty one. A Recommend and Wait pattern that presents its recommendations in a way that anchors the operator and provides no forcing function for independent evaluation is, in practice, an Execute and Report pattern with extra steps.

The next chapter examines how to present information at the seam – the specific communication formats and disclosure strategies that support good human decision-making in the face of these cognitive challenges.

Chapter 4

Chapter 4: Context Presentation

How you present information determines what the operator sees. And what the operator sees determines what they decide. This is not a metaphor. It is a measurable, reproducible phenomenon: the same incident data, presented in different formats, produces different decisions from the same operators with statistically significant consistency.

The cognitive challenges described in Chapter 3 – automation bias, anchoring, alert fatigue – are not fixed properties of human cognition. They are properties of the interaction between human cognition and information design. A well-designed presentation format can reduce anchoring. A poorly designed one can amplify it. The format is not a cosmetic layer applied after the engineering is done. It is a load-bearing element of the system architecture.

This chapter presents four evidence-based frameworks for context presentation at the AI-human seam, with specific guidance on how to apply each one in operational AI systems.

The SBAR Framework

SBAR – Situation, Background, Assessment, Recommendation – is a structured communication framework developed by the United States Navy for use on nuclear submarines, where communication errors between crew members could have catastrophic consequences. The framework was subsequently adapted for healthcare by Kaiser Permanente in the early 2000s, where it became the basis for clinical handoff communication across thousands of hospitals.

The Evidence

The adoption of SBAR in healthcare, facilitated through the TeamSTEPPS (Team Strategies and Tools to Enhance Performance and Patient Safety) program developed by the Department of Defense and the Agency for Healthcare Research and Quality, produced one of the most dramatic improvements in communication quality ever documented in a controlled study. Before TeamSTEPPS implementation, observers rated the adequacy of nurse-to-physician communication in the study population at 4.8%. After implementation – with SBAR as the core communication structure – adequacy ratings rose to 100%.

The magnitude of this improvement demands explanation. The information available to the nurses did not change. Their clinical knowledge did not change. What changed was the structure in which they communicated. SBAR gave them a framework that ensured they included all critical information, presented it in a predictable order, and made an explicit distinction between observation (Situation, Background) and interpretation (Assessment, Recommendation).

SBAR Adapted for AI Agent Output

The same principles apply directly to how an AI agent communicates with a human operator. An unstructured output – a wall of text summarizing an investigation – forces the operator to extract structure, which is exactly the kind of cognitive work that leads to missed information and anchoring on the first pattern recognized. A structured output reduces cognitive load and ensures completeness.

For operational AI systems, SBAR can be adapted into a six-element framework:

Element	SBAR Equivalent	Content	Purpose
WHAT HAPPENED	Situation	Concise statement of the event or condition detected	Orient the operator to the current state
WHAT I TRIED	Background	Actions the AI agent took during investigation or initial remediation	Provide context on what is already known and ruled out
WHAT I RECOMMEND	Recommendation	Specific recommended action with expected outcome	Give the operator a clear decision point
RISK LEVEL	Assessment	Severity classification with brief justification	Calibrate the urgency of the operator’s response
COST OF INACTION	(Extension)	What happens if no action is taken, with estimated timeline	Counter the status quo bias and create urgency where warranted
EVIDENCE	(Extension)	Links to logs, metrics, traces, and knowledge base articles	Enable independent verification and deep investigation

The extensions beyond standard SBAR – Cost of Inaction and Evidence – address specific challenges of AI-human interaction. Cost of Inaction counters the natural human tendency toward inaction when faced with uncertainty (the status quo bias). Evidence addresses automation bias by providing the raw material for independent verification, rather than asking the operator to trust the AI’s synthesis alone.

Practical Example

Consider an AI agent that has detected a potential security incident. Here is the same information presented in three formats:

Format 1: Raw Dump

Alert ID: SEC-2026-04891. Source: SIEM correlation rule CR-142. 
Timestamp: 2026-04-11T03:42:17Z. Multiple failed SSH login attempts 
detected from IP 198.51.100.42 targeting hosts prod-web-01 through 
prod-web-08. Total attempts: 847 over 23 minutes. Source IP 
geolocation: São Paulo, Brazil. No successful authentications. 
Rate: ~37 attempts/minute. Pattern consistent with credential stuffing. 
Threat intel: IP not in known blocklists. Previous activity from 
this IP: none in last 90 days. Host vulnerability scan (last 7 days): 
prod-web-03 has unpatched OpenSSH (CVE-2024-6387). Current SSH 
configuration: password authentication enabled on all hosts. 
Fail2ban status: active, threshold set at 10 failures/5 minutes, 
current ban list does not include source IP due to distributed 
timing pattern...

Format 2: SBAR Structure

WHAT HAPPENED: Credential stuffing attack in progress against 8 
production web servers from a single IP (198.51.100.42, São Paulo).

WHAT I TRIED: Queried threat intel (IP not in known blocklists), 
checked for successful auths (none found), reviewed SSH configs 
(password auth enabled on all 8 hosts), checked fail2ban (active 
but not triggered due to distributed timing).

WHAT I RECOMMEND: Block IP 198.51.100.42 at the network firewall 
and disable password-based SSH authentication on all production hosts.

RISK LEVEL: Medium. No breach detected yet, but prod-web-03 has 
an unpatched OpenSSH vulnerability (CVE-2024-6387) that could be 
exploited if the attacker shifts tactics.

COST OF INACTION: Continued brute-force attempts with potential 
exploitation of CVE-2024-6387 on prod-web-03. If the vulnerability 
is exploited, the attacker gains shell access to a production server.

EVIDENCE: [SIEM Alert] [SSH Logs] [Vuln Scan Report] [Threat Intel Query]

Format 3: Progressive Disclosure (detailed in the next section)

Layer 1 (5-second glance):
  🟡 MEDIUM | Credential stuffing on 8 prod web servers | 
  Recommend: Block source IP + disable password auth

Layer 2 (30-second assessment):
  [Full SBAR as above]

Layer 3 (deep dive):
  [Complete evidence chain with log excerpts, CVE details, 
  network topology, historical context]

The raw dump contains all the same information as the SBAR format, but it forces the operator to perform the cognitive work of structuring it. Under the time pressure and alert volume typical of security operations, this cognitive work is exactly what gets skipped – and its omission is what leads to missed context and poor decisions.

The Klein Recognition-Primed Decision Model

Gary Klein’s Recognition-Primed Decision (RPD) model, developed through field studies of firefighters, military commanders, and intensive care nurses, fundamentally challenges the classical model of decision-making as a process of comparing alternatives.

The Evidence

Klein’s research found that 78% of expert decisions were made in under one minute, and the process was not comparison-based but recognition-based. Experts did not generate a list of options, evaluate each against criteria, and select the best. Instead, they recognized the current situation as similar to a previously encountered pattern, retrieved the action that worked in that pattern, mentally simulated whether it would work in the current situation, and either executed it or modified it.

This has a direct and counterintuitive design implication: presenting multiple options to an expert operator may degrade decision quality rather than improve it. The expert’s cognitive process is optimized for evaluating a single option against the situation, not for comparing options against each other. A system that presents three possible root causes with pros and cons for each is fighting the expert’s natural decision process. A system that presents the single most likely root cause with supporting evidence and a recommended action is working with it.

Design Implication for AI Agents

Present the AI’s single recommended action first, with supporting evidence. Make alternative explanations available on demand (progressive disclosure, discussed below), but do not force the expert to process them before evaluating the primary recommendation.

This does not mean hiding alternatives. It means structuring the presentation so that the operator’s first cognitive engagement is with the most likely hypothesis, which is the engagement pattern that matches how experts actually think. If the primary recommendation does not match the operator’s pattern recognition – if something feels wrong – the operator will seek alternatives. The system should make that easy. But it should not force it as the default path.

Key distinction: For novice operators, presenting alternatives may be valuable because novices lack the pattern library that enables recognition-primed decisions. The optimal presentation format depends on the operator’s expertise level – another argument for adaptive interfaces that adjust to the user.

Time Pressure and Decision Quality

The interaction between time pressure and AI assistance is more nuanced than “faster is better” or “slower is safer.” Research by Swaroop et al. at Harvard (2023) found that different types of AI assistance have different accuracy-time tradeoffs, and the optimal type of assistance depends on the time available for the decision.

Under low time pressure, operators benefited most from AI assistance that provided explanations and supporting evidence – the kind of assistance that enables analytical reasoning and independent verification. Under high time pressure, operators benefited most from simple, direct recommendations – the kind of assistance that supports rapid pattern matching.

More concerning, the research found that under time pressure, decisions became riskier and overreliance on AI increased. Operators under time pressure were more likely to accept the AI’s recommendation without evaluation, more likely to choose the riskier option when the AI suggested it, and less likely to notice errors in the AI’s reasoning.

Design Implication

The presentation format should adapt to the urgency of the situation:

Low urgency (minutes to hours): Present full SBAR with evidence links, encourage independent verification, apply cognitive forcing functions (see Chapter 3).
Moderate urgency (seconds to minutes): Present SBAR summary with single recommended action, make evidence available but do not require review.
High urgency (immediate): Present action and severity only, with one-click execution. Log the decision for post-hoc review.

This maps directly to the Progressive Disclosure framework discussed next.

Progressive Disclosure

Progressive disclosure is an information architecture principle that organizes content into layers of increasing detail, allowing the user to access the level of detail they need without being overwhelmed by the level they do not. In operational AI systems, it is the primary mechanism for supporting both the rapid pattern-matching of experts and the thorough analysis of novices within a single interface.

The Three Layers

Layer 1: The 5-Second Glance

This is what the operator sees when they first look at the screen, scan a notification, or glance at a dashboard. It must communicate three things in five seconds or less:

Severity (visual indicator: color, icon, or categorical label)
Summary (one sentence: what happened and what is at stake)
Recommended action (one phrase: what to do)

Layer 1 supports the expert’s recognition-primed decision process. An experienced operator scanning Layer 1 either recognizes the pattern and acts, or does not recognize it and drills down. There is no wasted cognitive effort on detail that is not needed for the initial recognition.

Example:

🔴 CRITICAL | Database primary failover detected, replication lag 
increasing | Recommend: Promote replica db-replica-02 to primary

Layer 2: The 30-Second Assessment

This is the SBAR brief with confidence levels. It provides enough context for the operator to evaluate the AI’s recommendation, ask clarifying questions, or form an alternative hypothesis. It is the layer where the operator transitions from pattern recognition to analytical reasoning.

Layer 2 includes: - Full SBAR structure (What Happened, What I Tried, What I Recommend, Risk Level, Cost of Inaction) - AI confidence level (discussed in the next section) - Key metrics and their trends - Relevant recent changes or events

Layer 3: The Deep Dive

This is the full evidence chain – raw logs, metrics timeseries, configuration diffs, knowledge base articles, historical incident records, and the AI’s reasoning chain. It is used for post-incident review, for cases where the operator disagrees with the AI’s assessment, or for novel situations that do not match any known pattern.

Layer 3 is also where evidence linking (discussed below) provides its value, allowing the operator to trace the AI’s conclusions back to specific data points.

Why Three Layers

Three is not arbitrary. Cognitive load research consistently shows that humans can effectively process 3-5 chunks of information at a time (Miller, 1956; Cowan, 2001). Three layers map to three distinct cognitive modes:

Layer	Time	Cognitive Mode	Decision Type	User State
Layer 1	5 seconds	Pattern recognition	Act or investigate further	Scanning, triaging
Layer 2	30 seconds	Analytical reasoning	Approve, modify, or reject recommendation	Focused evaluation
Layer 3	Minutes to hours	Deep analysis	Root cause investigation, post-incident review	Deliberate investigation

Confidence Communication

How an AI agent communicates its confidence in a recommendation is one of the most consequential and most frequently mishandled aspects of context presentation.

The Problem with Raw Probabilities

The intuitive approach – presenting a numerical probability (“87% confidence this is a credential stuffing attack”) – is worse than useless for most operators. Research consistently shows that:

Humans miscalibrate probabilities, overweighting low probabilities and underweighting high ones (Kahneman & Tversky, 1979).
Numerical probabilities create false precision. “87% confidence” implies a level of calibration that no current LLM possesses.
Different operators interpret the same probability differently. “87%” might feel near-certain to one operator and uncomfortably uncertain to another.

Categorical Confidence with Calibration

A more effective approach uses categorical labels mapped to defined probability ranges and operational implications:

Category	Probability Range	Operational Implication
Confirmed	>95%	Evidence is conclusive; proceed with recommended action
High confidence	80-95%	Strong evidence; recommendation is likely correct but verify key assumptions
Moderate confidence	60-80%	Supporting evidence exists but alternative explanations are plausible; investigate before acting
Low confidence	40-60%	Evidence is ambiguous; treat as a lead for investigation, not a basis for action
Speculative	<40%	Insufficient evidence; further investigation required before any action

The value of categorical labels is not precision – it is calibration of operator behavior. “High confidence” communicates not just a probability but an expected response: verify key assumptions, then act. “Low confidence” communicates a different expected response: investigate further. The label guides behavior in a way that a number does not.

Uncertainty Visualization

Research by Reyes et al. (2025) found that presenting uncertainty visualizations – graphical representations of the AI’s confidence distribution rather than a single point estimate – enhanced appropriate trust for 58% of participants. Operators who saw uncertainty visualizations were better at calibrating their trust: trusting high-confidence outputs more and low-confidence outputs less, compared to operators who received only point estimates.

A complementary study at ACM FAccT (2025) found that distance-based confidence scores – metrics that communicate how similar the current situation is to the training data the AI was calibrated on – yielded 8.2% higher correct decisions compared to traditional confidence scores. Distance-based scores help operators understand not just how confident the AI is, but how relevant its confidence calibration is to the current situation.

Key insight: The goal of confidence communication is not to convey the AI’s internal state accurately. It is to calibrate the operator’s behavior appropriately. A confidence format that causes operators to verify high-confidence recommendations and investigate low-confidence ones is succeeding, regardless of how precisely it maps to the model’s actual probability distribution.

Evidence Linking and Explainability

The final component of context presentation is evidence linking: connecting the AI’s conclusions and recommendations to the specific data points that support them. This serves two functions: it enables independent verification (countering automation bias), and it provides the raw material for the operator to construct their own situation awareness rather than relying entirely on the AI’s synthesis.

RAG Citations and Inline References

For AI agents using retrieval-augmented generation (RAG), the most straightforward form of evidence linking is inline citations: marking each claim in the AI’s output with a reference to the source document, log entry, or metric that supports it. This is the same approach used in academic writing, adapted for operational context.

Example:

Root cause assessment: The connection pool exhaustion on db-primary-01 
[1] was triggered by the deployment of v2.4.7 at 14:32 UTC [2], which 
introduced a connection leak in the user authentication module [3]. 
Connection count increased from baseline 45 to maximum 500 over 
23 minutes [4], causing cascading timeouts in downstream services [5].

Sources:
[1] CloudWatch metric: db-primary-01 active connections (14:00-15:00 UTC)
[2] Deployment log: v2.4.7 release record
[3] Git diff: commit a3f7c2e, file auth/connection_pool.py, lines 142-158
[4] Connection pool metrics dashboard (link)
[5] Service dependency map with error propagation trace (link)

Progressive Disclosure of Reasoning Chain

For more complex analyses, the AI’s reasoning chain itself can be presented using progressive disclosure:

Layer 1: Conclusion and recommended action (no reasoning).
Layer 2: Key reasoning steps – the 3-4 most important logical connections between evidence and conclusion.
Layer 3: Full reasoning chain, including hypotheses that were considered and rejected, with evidence for and against each.

This approach respects the expert’s recognition-primed decision process (Layer 1 is sufficient if the pattern is familiar) while providing the full audit trail for cases that require deeper analysis or post-incident review.

The DARPA XAI Program

The Defense Advanced Research Projects Agency (DARPA) invested $75 million in its Explainable AI (XAI) program, which ran from 2017 to 2021 and tested multiple approaches to making AI systems’ reasoning transparent to human operators. The key finding for operational context: example-based explanations were the most effective at improving human decision-making.

Rather than explaining the AI’s internal logic (“the neural network assigned weight 0.73 to feature X”), example-based explanations present similar cases from the past and their outcomes: “This situation is similar to Incident INC-2025-3847, which was caused by a DNS misconfiguration and resolved by flushing the DNS cache. The resolution took 12 minutes and no customer impact was reported.”

Example-based explanations work because they align with the recognition-primed decision model: they help the operator match the current situation to a known pattern, which is the cognitive process experts actually use.

Putting It All Together

Effective context presentation at the AI-human seam integrates all four frameworks:

Structure the output using SBAR to ensure completeness and predictability.
Prioritize the recommended action first, consistent with the RPD model, and make alternatives available on demand.
Layer the information using progressive disclosure so that each operator can engage at the depth appropriate to their expertise and the situation’s urgency.
Calibrate confidence communication using categorical labels with operational implications, not raw probabilities.
Link conclusions to evidence using inline citations and example-based explanations.

These are not independent design choices. They interact: SBAR provides the structure for Layer 2. The RPD model determines what goes in Layer 1. Confidence communication determines how the operator engages with Layers 1 and 2. Evidence linking populates Layer 3.

The result, when implemented cohesively, is a presentation format that:

Supports fast pattern-matching for experienced operators (Layer 1, RPD alignment)
Enables analytical evaluation when needed (Layer 2, SBAR structure)
Provides full audit trail for post-hoc review and learning (Layer 3, evidence linking)
Calibrates operator trust appropriately (confidence communication)
Reduces automation bias by making independent verification easy (evidence linking)
Reduces anchoring by presenting data before interpretation when time permits (SBAR ordering)

The next chapter examines how trust between human operators and AI agents develops, calibrates, and – when mismanaged – collapses.

Chapter 5

Chapter 5: Trust Calibration

Trust is not a binary. It is a calibration problem.

When a GenAI engineer deploys an AI agent into an operational environment — an IT service desk, a network operations center, a clinical workflow — the central design challenge is not accuracy. It is not latency, cost per token, or even safety in the abstract. The central challenge is ensuring that the humans who work alongside the agent trust it exactly as much as it deserves to be trusted. Not more. Not less. This chapter examines what trust in automated systems actually consists of, how it forms and breaks, and how to design interaction patterns that keep it properly calibrated. Note that this chapter addresses operator-side calibration — how humans interpret and act on AI confidence signals. Model-side calibration (whether the model’s stated confidence matches actual accuracy) is a separate engineering problem; Chapter 6 provides a practical workflow for empirically calibrating model confidence against operational outcomes.

The Lee & See Framework: Performance, Process, Purpose

The foundational model for understanding trust in automation comes from Lee and See (2004), who synthesized decades of research into a three-dimensional framework. Trust, they argued, is not a single attitude but a composite of three distinct judgments:

Performance — Can it do the job? This dimension captures the operator’s assessment of the system’s competence: its accuracy, reliability, and consistency across the tasks it is expected to handle.
Process — How does it work? This dimension reflects understanding of the system’s internal logic. An operator who can form a reasonable mental model of why the system produces a given output will calibrate trust more effectively than one who treats it as a black box.
Purpose — Why was it built this way? This dimension addresses the operator’s belief about the designer’s intent. Does the system serve the operator’s goals, or does it optimize for something else?

Each dimension can be miscalibrated independently. An operator might trust the system’s performance based on a run of good outcomes, while having no understanding of its process — a combination that produces brittle trust, vulnerable to collapse at the first unexpected failure. Conversely, an operator who understands the process well but has never seen the system handle an edge case may calibrate performance trust too high.

Key distinction: Overtrust leads to automation bias and complacency — the operator stops checking the system’s work, accepts incorrect recommendations, and loses situational awareness. Undertrust leads to disuse and inefficiency — the operator ignores valid recommendations, duplicates effort, and negates the value of the system entirely. Both failure modes are well-documented in safety-critical domains, and both are present in every AI-augmented operation.

The practical implication for GenAI engineers is that trust calibration requires deliberate design across all three dimensions. Displaying accuracy metrics addresses Performance. Showing reasoning traces addresses Process. Documenting design decisions and optimization targets addresses Purpose. Neglecting any dimension creates a calibration gap.

Dispositional, Situational, and Learned Trust

Hoff and Bashir (2015) extended the trust literature into a layered model that explains why different operators respond so differently to the same system. Their framework identifies three layers of trust that operate simultaneously:

Dispositional trust is the baseline. It reflects an individual’s general tendency to trust or distrust automated systems, shaped by personality, culture, age, and prior experience with technology broadly. A 25-year-old engineer who grew up with recommendation algorithms arrives with a different dispositional baseline than a 55-year-old operations manager whose career predates the internet. Neither baseline is inherently better — both can produce miscalibration.

Situational trust is context-dependent. It fluctuates based on the current operating environment: workload, time pressure, perceived risk, and the availability of alternatives. An operator under extreme time pressure in a P1 incident is more likely to accept an AI recommendation without scrutiny — not because they trust the system more in any stable sense, but because the cost of verification feels higher than the risk of error. This is precisely when automation bias is most dangerous.

Learned trust is the layer that accumulates through direct experience with the specific system. It is the most powerful and the most designable. Merritt and Ilgen (2008) demonstrated that trust shifts rapidly from dispositional to learned within the first few interactions — sometimes in as few as three to five encounters. This finding has profound design implications: the onboarding experience is not merely an introduction. It is the period during which the operator’s long-term trust calibration is being established.

For GenAI engineers, this layered model suggests a phased approach to trust design:

During onboarding, account for dispositional variation. Do not assume a uniform starting point. Some operators will over-rely immediately; others will resist engagement entirely.
During high-pressure operations, design for situational trust inflation. Add friction — confirmation steps, mandatory review of reasoning — precisely when operators are most tempted to skip it.
Across the operational lifecycle, invest heavily in the learned trust layer. Provide transparent performance data. Surface failures honestly. Make the system’s track record visible and navigable.

First-Person Uncertainty Expression

One of the most actionable findings in recent trust calibration research comes from Kim et al. (FAccT 2024, Microsoft Research, N=404). The study examined how AI systems should communicate uncertainty and found that the linguistic framing of uncertainty matters as much as whether uncertainty is communicated at all.

When an AI system expressed uncertainty in the first person — “I’m not sure, but I think this ticket should be categorized as a network issue” — participants reported decreased confidence in the system’s recommendation. At first glance, this seems like a failure. But the critical finding was that this decreased confidence was accompanied by increased decision accuracy. Participants who received first-person hedging were more likely to independently evaluate the recommendation, catch errors, and arrive at correct conclusions.

By contrast, general-perspective hedging — “This might be a network issue” or “There is some uncertainty about the categorization” — produced a weaker effect. The first-person framing appears to activate a different cognitive process: instead of treating uncertainty as a property of the problem (which the operator may not feel equipped to resolve), the first-person framing treats uncertainty as a property of the system’s judgment, which the operator recognizes as something they can and should evaluate.

Key insight: Designing an AI agent to say “I’m not sure” is not a concession of weakness. It is a calibration mechanism. The goal is not to maximize the operator’s confidence in every recommendation — it is to maximize the operator’s accuracy in the decisions they make based on those recommendations.

The implementation pattern is straightforward but requires discipline:

When model confidence is below a defined threshold (calibrated to the specific use case), prepend first-person uncertainty markers to the recommendation.
Use specific language: “I’m not confident about this assessment” rather than vague hedging like “This could potentially be…”
Pair the uncertainty expression with the system’s reasoning, so the operator knows what the system is uncertain about and can focus their verification accordingly.

Track Record Dashboards

A 2024 study of National Weather Service (NWS) forecasters who were integrating AI prediction tools into their workflow found a striking consensus: all forecasters deemed it essential to examine AI predictions for past cases before trusting the system’s current output. They did not want to evaluate the AI on a single forecast. They wanted to see its track record — particularly its failures.

This finding aligns with the learned trust layer in Hoff and Bashir’s framework and points to a concrete design requirement: track record dashboards. These are not simple accuracy percentages. They are navigable histories that allow operators to build calibrated mental models of where the system succeeds and where it fails.

An effective track record dashboard for an AI-augmented operation should include:

Accuracy by action type. An AI agent that correctly resolves 94% of password reset tickets but only 61% of VPN configuration issues needs those numbers displayed separately. A blended accuracy metric hides the variation that operators need for calibration.
Error logs with context. When the system was wrong, what did it get wrong, and why? Searchable, categorized error histories allow operators to develop pattern recognition for the system’s failure modes.
Escalation history. How often does the system escalate to a human, and what happens after escalation? A system that escalates 40% of cases may be well-calibrated; a system that escalates 2% of cases may be dangerously overconfident.
Temporal trends. Is the system improving, degrading, or stable? Operators who can see performance trends develop more sophisticated trust models than those who see only current snapshots.
Comparison to human baseline. Where available, show how the AI’s performance compares to unassisted human performance on the same task types. This grounds calibration in operational reality rather than abstract expectations.

Trust Repair After Failures

Trust in automated systems, once damaged, follows an asymmetric trajectory that every GenAI engineer must account for. De Visser, Pak, and Shaw (2018) documented this pattern rigorously: trust declines rapidly after a failure — often in a single event — but recovers slowly, requiring multiple successful interactions to return to pre-failure levels. The asymmetry is not small. A single high-visibility failure can erase weeks or months of earned trust.

This asymmetry creates a design imperative: trust repair must be an active, designed process, not a passive consequence of resumed good performance. Simply continuing to operate correctly after a failure is insufficient. The system — and the organization around it — must take explicit repair actions.

Pak and Rovira (2023) investigated what kinds of repair actions are most effective and found a clear hierarchy: substantive explanations outperform emotional apologies. When an AI system fails and then provides a clear, technical explanation of why the failure occurred and what has changed to prevent recurrence, trust recovers faster than when the system (or its operators) simply acknowledges the error and expresses regret. This finding should not surprise engineers, but it has direct implications for incident communication design.

Effective trust repair strategies include:

Immediate acknowledgment. The system should surface its own failures rather than waiting for the operator to discover them. A system that says “I made an error in my previous recommendation — here is what I got wrong” preserves more trust than one whose errors are discovered independently.
Root cause explanation. Provide a technically honest explanation of why the failure occurred, at the appropriate level of detail for the operator. “I hallucinated a non-existent API endpoint because the training data contained deprecated documentation” is more repair-effective than “An error occurred.”
Remediation evidence. When possible, show what has changed. If a guardrail has been added, a prompt has been refined, or a knowledge base has been updated, communicate this concretely.
Graduated re-engagement. After a significant failure, temporarily increase the level of human oversight. This is not punishment — it is a calibration mechanism that allows the operator to rebuild learned trust through direct observation.

Behavioral Metrics for Trust Calibration

Designing for trust calibration is only half the problem. The other half is measuring whether calibration is actually occurring. Several validated approaches exist.

The Jian et al. (2000) Trust in Automated Systems scale is the most widely used self-report instrument, consisting of 12 items that assess trust and distrust as separate constructs. It is useful for periodic assessments but limited by the standard weaknesses of self-report measures: operators may not accurately report their own trust levels, and the act of measurement may alter the thing being measured.

Behavioral metrics are more diagnostic for operational settings:

Compliance rate measures how often the operator follows the AI’s recommendation. High compliance (>95%) in a system with known error rates suggests overtrust. Low compliance (<50%) for a well-performing system suggests undertrust.
Weight of Advice (WoA) captures not just whether the operator follows the recommendation but how much they adjust their initial judgment toward it. A WoA of 0 means the operator ignores the AI entirely; a WoA of 1 means they adopt its recommendation without modification.
Override rates stratified by confidence level are the most diagnostic metric available. An operator who overrides the AI at the same rate regardless of whether the system reports 60% or 99% confidence is not calibrated — they are either ignoring the confidence information or treating it as meaningless. A well-calibrated operator overrides more at lower confidence levels and less at higher ones.

A study using the MIMIC-III clinical dataset with an AI clinical decision support system (AI-CDSS) demonstrated the power of this metric: recommendations at the 90–99% confidence level were overridden at a rate of only 1.7%. This suggests strong calibration — operators trusted high-confidence recommendations appropriately. The critical question then becomes whether the 1.7% of overrides at high confidence captured genuine system errors, which requires tracking override accuracy over time.

Key insight: A well-calibrated trust relationship means the operator questions the AI exactly when the AI is most likely to be wrong. Measuring this requires correlating override decisions with confidence levels and, ultimately, with outcome correctness.

Trust Calibration Mechanisms: A Summary

The following table consolidates the mechanisms discussed in this chapter into a reference for implementation:

Mechanism	What It Does	Evidence	Implementation
First-person uncertainty expression	Decreases operator confidence while increasing decision accuracy	Kim et al. (FAccT 2024, N=404)	Prepend “I’m not sure, but…” when model confidence falls below calibrated threshold
Track record dashboards	Enables operators to build learned trust through historical performance review	NWS forecasters study (2024); Hoff & Bashir (2015) learned trust layer	Accuracy by action type, searchable error logs, escalation history, temporal trends
Graduated autonomy during onboarding	Accounts for rapid shift from dispositional to learned trust	Merritt & Ilgen (2008): trust shifts in first 3–5 interactions	Start with human-in-the-loop for all actions; expand autonomy based on demonstrated calibration
Situational friction injection	Counteracts trust inflation under time pressure	Hoff & Bashir (2015) situational trust layer	Mandatory confirmation steps during high-severity incidents; cannot be bypassed
Active trust repair	Accelerates trust recovery after failures through substantive explanation	De Visser et al. (2018); Pak & Rovira (2023)	Self-surfaced errors, root cause explanations, remediation evidence, graduated re-engagement
Stratified override tracking	Measures whether operators are actually calibrated	Jian et al. (2000) scale; MIMIC-III AI-CDSS study	Track override rates by confidence band; flag operators who override uniformly regardless of confidence
Performance-Process-Purpose transparency	Addresses all three dimensions of trust simultaneously	Lee & See (2004)	Accuracy metrics (Performance), reasoning traces (Process), design documentation (Purpose)

Designing for Calibration, Not Maximization

The instinct of many engineering teams is to maximize trust — to build systems so reliable and so impressive that operators trust them completely. This instinct is wrong. Complete trust is miscalibrated trust. It produces automation bias, complacency, and catastrophic failures when the system inevitably encounters a case outside its competence.

The goal is calibration: a dynamic, context-sensitive relationship in which the operator’s trust tracks the system’s actual reliability across different task types, confidence levels, and operating conditions. Achieving this requires treating trust not as a marketing problem (how do we make people trust our system?) but as a measurement and control problem (how do we ensure that the operator’s trust level matches the system’s actual capability in this specific context?).

Every design decision in an AI-augmented operation — from the phrasing of recommendations to the layout of dashboards to the structure of incident reviews — either helps or hinders trust calibration. There is no neutral ground. The patterns described in this chapter provide a foundation, but calibration is never finished. It must be monitored, measured, and adjusted continuously, because both the system and the humans who use it are always changing.

Chapter 6

Chapter 6: Implementing the Patterns

The preceding chapters described what to build and why. This chapter describes how to build it. Every section produces an artifact — a prompt template, a decision table, a configuration, a workflow, or a checklist — that can be taken directly into a production system. The goal is not to restate theory but to translate it into implementation.

Prompt Templates for Structured Output

The interaction patterns described in Chapters 2 through 5 depend on the LLM agent producing output in specific formats. Left to its own defaults, a model will generate fluent, conversational prose — exactly the wrong format for an operator making time-sensitive decisions. Structured output requires structured prompts. The three templates below address the most common formatting needs: situation reporting, uncertainty expression, and confidence classification.

Template 1: SBAR-Formatted Output

Chapter 4 introduced the SBAR framework adapted for AI agent output: WHAT HAPPENED, WHAT I TRIED, WHAT I RECOMMEND, RISK LEVEL, COST OF INACTION, and EVIDENCE. The following system prompt instructs the agent to use this structure for every recommendation.

You are an AI operations agent. Structure EVERY recommendation using this exact format:

WHAT HAPPENED: [One sentence. State the event or condition you detected.]

WHAT I TRIED: [List the investigative actions you took and what you found. Include negative findings ("no anomalies detected in X").]

WHAT I RECOMMEND: [One specific action. State the expected outcome.]

RISK LEVEL: [CRITICAL / HIGH / MEDIUM / LOW. One sentence justifying the classification.]

COST OF INACTION: [What happens if the operator does nothing? Include a time estimate if possible.]

EVIDENCE: [List specific data sources, log entries, metric values, or KB articles that support your assessment. Use references the operator can verify independently.]

Rules:
- Never skip a section. If you have nothing for a section, write "None identified."
- WHAT I RECOMMEND must contain exactly one recommended action, not a list of options.
- COST OF INACTION must describe a concrete consequence, not a vague risk statement.
- EVIDENCE must reference specific, verifiable data --- not summaries of data.

Example output:

WHAT HAPPENED: Pod checkout-service-7b4f has crash-looped 4 times 
in the last 12 minutes in the prod-eu-west cluster.

WHAT I TRIED: Pulled container logs (OOMKilled on each restart), 
checked memory allocation (limit: 512Mi, peak usage before kill: 
508Mi), reviewed recent deployments (v3.12.1 deployed 47 minutes 
ago, added PDF receipt generation feature), checked other pods in 
the service (3 of 8 pods affected, all processing PDF receipts).

WHAT I RECOMMEND: Increase memory limit for checkout-service to 
1024Mi. Expected outcome: pods stabilize, PDF receipt generation 
completes without OOM kills.

RISK LEVEL: MEDIUM. Service is degraded (3 of 8 pods cycling) but 
not down. Load balancer is routing around affected pods.

COST OF INACTION: If additional pods begin processing PDF receipts, 
more pods will OOM. At current traffic levels, full service 
degradation is likely within 30--60 minutes.

EVIDENCE: [Pod logs: OOMKilled x4] [Metrics: memory peak 508/512Mi] 
[Deployment log: v3.12.1 at 14:23 UTC] [Git diff: +PDF generation]

Watch for: The most common failure mode is the model conflating Assessment and Recommendation — producing a RISK LEVEL that is actually a recommendation (“RISK LEVEL: HIGH — we should restart the service immediately”) or a WHAT I RECOMMEND that is actually an assessment (“I recommend monitoring the situation”). Enforce the distinction in your prompt and in post-processing validation. A second failure mode is omitting COST OF INACTION or filling it with a tautology (“If no action is taken, the problem will persist”). Explicitly instruct the model to describe a concrete consequence with a timeline.

Template 2: First-Person Uncertainty Expression

Chapter 5 discussed the Kim et al. (FAccT 2024) finding that first-person uncertainty expression (“I’m not sure, but…”) decreases operator confidence while increasing decision accuracy. The following template implements this behavior, keyed to a confidence threshold.

You are an AI operations agent. After analyzing each situation, 
internally assess your confidence in your recommendation on a 
scale from 0 to 100.

If your confidence is 75 or above:
  Present your recommendation directly, without hedging.

If your confidence is below 75:
  Begin your recommendation with a first-person uncertainty 
  expression. Use phrases like:
  - "I'm not confident about this, but..."
  - "I'm uncertain here --- my best assessment is..."
  - "I don't have strong evidence for this, but..."

  Then state what specifically you are uncertain about and why.

Never use vague, third-person hedging like "it is possible that" 
or "there may be an issue with." Always use first-person language 
that attributes the uncertainty to your own judgment.

After your recommendation, state your confidence level explicitly:
  [Confidence: XX/100 — one-sentence reason]

Example output at high confidence (82/100):

The root cause is a connection pool exhaustion in the payments 
database. The connection count has been at the maximum (100) for 
the past 14 minutes, and all timed-out requests correlate with 
payment processing endpoints.

Recommended action: Increase max_connections from 100 to 200 and 
investigate the long-running queries holding connections open.

[Confidence: 82/100 — Strong correlation between connection pool 
saturation and error pattern; confirmed via two independent metrics.]

Example output at low confidence (38/100):

I'm not confident about this, but my best assessment is that the 
intermittent 502 errors are caused by the new rate limiter 
configuration deployed this morning. I'm uncertain because the 
errors started approximately 90 minutes after the deployment, not 
immediately, and I cannot find a clear causal link in the logs.

Recommended action: Temporarily disable the new rate limiter and 
observe whether 502 errors subside.

[Confidence: 38/100 — Temporal correlation is weak; no direct 
evidence linking the rate limiter to the 502 responses.]

Watch for: Models tend toward one of two failure modes with uncertainty expression. The first is universal hedging — the model prepends “I’m not sure” to every response regardless of actual confidence, which trains operators to ignore the signal entirely. The second is false precision — the model never drops below 70/100 even when its reasoning is clearly speculative. Both modes require calibration (see Section 5 of this chapter). If you observe universal hedging, raise the threshold or add few-shot examples of confident responses. If you observe false precision, add explicit instructions to lower confidence when reasoning depends on assumptions rather than evidence.

Template 3: Graduated Confidence with Reasoning

For systems where a numeric confidence score is too granular and a binary high/low is too coarse, the following template implements a categorical confidence system with mandatory reasoning.

You are an AI operations agent. For every recommendation, classify 
your confidence using exactly one of these levels:

CONFIRMED — I have verified this through multiple independent 
sources. I am certain this is correct.

HIGH — Strong evidence supports this conclusion. One or more 
independent signals corroborate it.

MODERATE — The evidence is suggestive but not conclusive. There 
are plausible alternative explanations.

LOW — I am reasoning from limited or indirect evidence. My 
conclusion is an educated guess.

SPECULATIVE — I have very little evidence. This is my best 
hypothesis, but it could easily be wrong.

After the confidence label, provide exactly one sentence explaining 
what evidence supports (or fails to support) your assessment.

Format: [Confidence: LEVEL — reasoning sentence]

Example output:

The disk space alert on db-primary-01 is caused by unrotated 
PostgreSQL WAL files accumulating in pg_wal/. Current usage is 
94% with 847 WAL files totaling 13.2 GB.

Recommended action: Run pg_archivecleanup to remove WAL files 
older than the last successful backup checkpoint.

[Confidence: CONFIRMED — Verified via df output, ls -la pg_wal/, 
and pg_controldata showing last checkpoint LSN.]

Watch for: The model’s self-reported confidence may not match its actual accuracy. A model that labels 40% of its recommendations as CONFIRMED but is only correct 70% of the time in that band is poorly calibrated and will erode operator trust. Categorical confidence labels must be empirically validated against outcome data. Section 5 of this chapter describes how to do this. Until calibration data is available, treat these labels as hypotheses, not guarantees.

Graduated Autonomy Decision Framework

Chapter 2 introduced five structural patterns. The question every implementation team faces is: which pattern applies to which action? The following framework provides a systematic method for making that classification.

Terminology note: If you have read Building Agentic AI, the risk classification system (LOW/MEDIUM/HIGH) and assertiveness levels (cautious/balanced/autonomous) described there map directly to the Recommend & Wait through Execute & Report spectrum below. The taxonomies are complementary: Building Agentic AI addresses the agent-internal engineering; this guide addresses the operator-facing interaction design.

Step 1: Enumerate actions. List every action your AI agent is capable of taking. Include investigative actions (querying a database, pulling logs), communicative actions (sending alerts, creating tickets), and operational actions (restarting services, modifying configurations, blocking IPs).

Step 2: Assess four dimensions for each action. For each action on your list, evaluate:

Dimension	Question	Scale
Consequence severity	What is the worst realistic outcome if this action is wrong?	Low / Medium / High / Critical
Reversibility	Can this action be undone? How quickly and at what cost?	Instant / Minutes / Hours / Difficult / Irreversible
Time sensitivity	What is the operational cost of waiting for human approval?	Low (can wait hours) / Medium (minutes matter) / High (seconds matter)
AI confidence	How reliably can the model make this decision correctly?	Based on calibration data, not intuition

Step 3: Map to pattern. Use the following decision logic:

High consequence + Irreversible = Recommend & Wait (Levels 4–5), regardless of time sensitivity
High consequence + Reversible + Time-critical = Recommend & Wait with pre-staged action (Level 5)
Medium consequence + Reversible = Recommend & Wait or Execute & Report, depending on calibrated confidence
Low consequence + Reversible + Time-critical = Execute & Report (Level 7)
Any consequence level + Low AI confidence = Recommend & Wait, always

Action Classification Worksheet

The following worksheet demonstrates the framework applied to common infrastructure operations actions. Use it as a template — replace the example rows with your own agent’s action inventory.

Action	Consequence if Wrong	Reversible?	Time to Decide	Confidence Required	Pattern	Autonomy Level
Restart crashed pod	Low (pod restarts anyway)	Yes (instant)	High (downtime ongoing)	Low	Execute & Report	L7
Scale up replicas	Low (cost increase)	Yes (scale down)	High (load spike)	Low	Execute & Report	L7
Block IP via WAF	Medium (may block legitimate users)	Yes (unblock)	High (active attack)	Medium	Recommend & Wait	L5
Failover database	High (data integrity risk)	Difficult (manual reconciliation)	Medium (degraded service)	High	Recommend & Wait	L4
Roll back deployment	Medium (feature regression)	Yes (re-deploy)	Medium (errors accumulating)	Medium	Recommend & Wait	L5
Modify firewall rules	High (may break connectivity)	Yes but complex (rule ordering)	Low (planned change)	High	Recommend & Wait	L4
Deploy config change	High (may cause outage)	Yes (revert commit)	Low (planned change)	High	Draft & Refine	L5
Delete old log data	Medium (permanent data loss)	No (irreversible)	Low (storage cleanup)	Medium	Recommend & Wait	L4

Key insight: The worksheet often reveals that teams have granted their agents too much autonomy for irreversible actions and too little for trivially reversible ones. If your agent requires human approval to restart a crashed pod but autonomously modifies firewall rules, the classification is inverted.

Circuit Breaker and Fallback Architecture

Chapter 6 described the circuit breaker pattern and its rationale. This section provides the implementation specification: what to monitor, what thresholds to set, and what fallbacks to configure.

Three Levels of Circuit Breakers

An LLM agent system has three categories of dependencies, each requiring its own circuit breaker configuration.

Level 1: LLM API circuit breaker. This monitors response latency and error rate from the model provider. It trips after N consecutive failures (recommended starting value: 3) or when the error rate exceeds P% in a rolling time window (recommended starting values: 30% error rate in a 60-second window). Fallback options, in order of preference: route to a backup LLM provider with an adapted prompt; return pre-generated responses from a cache of common scenarios; escalate directly to human with raw context data and no AI synthesis. The choice depends on whether a backup provider is contractually and technically available.

Level 2: Tool execution circuit breaker. This monitors the tools the agent calls — monitoring APIs, ticketing systems, knowledge bases, databases. Each tool gets its own circuit breaker instance because tool failures are typically independent. A monitoring API outage should not prevent the agent from querying the knowledge base. Trips after 5 consecutive failures or 50% error rate in a 120-second window (adjust per tool criticality). Fallback: skip the failing tool and note its unavailability in the output (“Note: monitoring API unavailable — metrics data not included in this assessment”), use cached data from the last successful query, or escalate to human if the tool is essential to the action.

Level 3: Quality gate circuit breaker. This monitors the quality of the agent’s own outputs — the distribution of confidence scores, the pass rate of validation checks, and the rate of operator overrides. It trips when quality degrades below a defined threshold: for example, when more than 40% of recommendations in a 30-minute window are classified as LOW or SPECULATIVE confidence, or when the operator override rate exceeds 60% in the same window. Fallback: downshift autonomy level for all actions. Any action currently classified as Execute & Report reverts to Recommend & Wait. The system continues to analyze and recommend, but takes no autonomous action until the quality gate circuit breaker closes.

Fallback Configuration Template

Dependency	Failure Threshold	Fallback Action	Recovery Test	Escalation Path
LLM API (primary)	3 consecutive errors or 30% error rate / 60s	Route to backup provider; if unavailable, return cached responses	Single request to primary provider	Alert on-call engineer after 5 min in OPEN state
Monitoring API	5 consecutive errors or 50% error rate / 120s	Use last cached metric snapshot (max age: 10 min); flag data staleness in output	Single health check query	Alert on-call if cached data exceeds max age
Ticketing system	5 consecutive errors or 50% error rate / 120s	Queue ticket creation locally; retry on circuit close	Single ticket read query	Alert on-call after 15 min in OPEN state
Knowledge base	3 consecutive errors or 30% error rate / 60s	Proceed without KB context; note in output: “Knowledge base unavailable”	Single search query	No escalation; log only
Action executor (e.g., K8s API)	2 consecutive errors	Halt all autonomous actions; switch to Recommend & Wait	Single read-only API call (e.g., list pods)	Alert on-call immediately

State Machine

The circuit breaker state machine is identical across all three levels. Only the thresholds and fallback actions differ.

CLOSED ──(threshold exceeded)──► OPEN
  ▲                                │
  │                                │ (timeout elapsed)
  │                                ▼
  └──(test succeeds)──── HALF_OPEN
                            │
                            │ (test fails)
                            ▼
                           OPEN

CLOSED: Normal operation. Failure counter increments on each failure, resets on success or after the time window expires. OPEN: All requests routed to fallback. A recovery timeout begins (recommended starting value: 60 seconds for LLM API, 120 seconds for tools, 300 seconds for quality gate). HALF_OPEN: A single test request is sent to the primary path. Success returns to CLOSED and resets the failure counter. Failure returns to OPEN and doubles the recovery timeout, up to a configured maximum (recommended: 10 minutes).

Kill Switch Architecture

Chapter 6 established the requirements and rationale for kill switches. This section specifies the architecture.

What the Kill Switch Must Control

All LLM API calls originating from the agent
All tool invocations (MCP tool calls, function calls, API requests)
All autonomous actions (anything the agent executes without human approval)
All scheduled and queued actions (pending approvals, batched operations, cron-triggered tasks)

What the Kill Switch Must NOT Control

Monitoring and observability dashboards (operators need to see what happened)
Logging and audit trail (the record must continue even when the agent stops)
Manual operation interfaces (operators must be able to work without the agent)
Alert routing to human operators (alerts must still reach people)

The distinction is critical. A kill switch that also disables monitoring leaves operators blind. A kill switch that stops logging destroys the evidence needed for incident review.

Architecture

┌─────────────────────────────────────────────┐
│  OPERATOR INTERFACE                         │
│  ┌─────────────────────────────────────┐    │
│  │  [KILL SWITCH]  ← always visible    │    │
│  └──────────┬──────────────────────────┘    │
│             │                               │
│             ▼                               │
│  ┌─────────────────────────────────────┐    │
│  │  INFRASTRUCTURE CONTROL PLANE       │    │
│  │  (external to AI agent process)     │    │
│  │                                     │    │
│  │  agent_enabled: true/false          │    │
│  │  ─────────────────────────────      │    │
│  │  append-only audit log              │    │
│  └──────────┬──────────────────────────┘    │
│             │                               │
│             ▼                               │
│  ┌─────────────────────────────────────┐    │
│  │  AI AGENT PROCESS                   │    │
│  │  checks agent_enabled before        │    │
│  │  every LLM call and tool invocation │    │
│  │                                     │    │
│  │  CANNOT modify agent_enabled        │    │
│  │  CANNOT access audit log            │    │
│  └─────────────────────────────────────┘    │
└─────────────────────────────────────────────┘

The agent_enabled flag lives in infrastructure the agent cannot reach — a separate configuration store, a feature flag service, or a hardware switch. The agent reads this flag but cannot write to it. The audit log records every state change with timestamp, operator identity, and reason.

Implementation Requirements

The agent process must check agent_enabled at two points: before every LLM API call, and before every tool invocation. This is a synchronous, blocking check — not an asynchronous polling loop. If the flag is false, the agent immediately returns a standard “agent disabled” response without making the call.

Queued and scheduled actions require additional handling. When the kill switch is activated, the system must drain or cancel all pending actions. A kill switch that stops new actions but allows queued actions to execute is not a kill switch — it is a pause button with a potentially long tail.

Testing Cadence

Test the kill switch monthly. Each test should document:

Who activated the kill switch
How long from activation to full stop (target: under 5 seconds)
What actions were in flight at the time of activation
Whether any actions leaked through after activation
How long from reactivation to normal operation

If any actions leak through during a test, the kill switch implementation has a bug. Fix it before the next production deployment.

Confidence Calibration Workflow

The prompt templates in Section 1 instruct the model to report confidence levels. But a model’s self-reported confidence is only useful if it correlates with actual accuracy. This section describes the operational workflow for calibrating confidence empirically.

Step 1: Collect Baseline Data

Run the agent in Recommend & Wait mode — no autonomous actions — for a minimum of 200 recommendations. For each recommendation, record four data points: the agent’s recommendation, the model’s reported confidence (numeric or categorical), the human operator’s decision (accept without modification, accept with modification, or reject), and the actual outcome (was the action correct or incorrect, assessed after the fact).

Two hundred is a minimum for statistical significance. For systems with high action diversity (many different types of recommendations), increase the sample size to ensure at least 30 observations per action type.

Step 2: Build the Calibration Curve

Group recommendations by confidence band. For numeric confidence, use bands of 20 percentage points. For categorical confidence, use the categories directly. For each band, calculate the actual accuracy rate.

Confidence Band	Count	Correct	Accuracy
0–20% (SPECULATIVE)	12	3	25%
21–40% (LOW)	28	14	50%
41–60% (MODERATE)	47	31	66%
61–80% (HIGH)	68	57	84%
81–100% (CONFIRMED)	45	42	93%

A perfectly calibrated model would show accuracy that matches the midpoint of each confidence band: 10% accuracy in the 0–20% band, 30% in the 21–40% band, and so on. In practice, models are almost always overconfident — their stated confidence exceeds their actual accuracy. The calibration curve quantifies by how much, which is the information you need to set operational thresholds.

Step 3: Set Operational Thresholds

Based on calibration data, define the confidence boundaries that map to operational behavior:

Above X% (where X is the confidence level at which accuracy exceeds your minimum acceptable rate): label as HIGH confidence. These recommendations may be candidates for autonomous execution if other criteria (consequence, reversibility) are met.
Between Y% and X%: label as MODERATE. These recommendations are presented to the operator with standard formatting.
Below Y% (where Y is the confidence level below which accuracy drops below an unacceptable rate): label as LOW. These recommendations trigger first-person uncertainty expression, require mandatory human review, and are never eligible for autonomous execution.

The specific values of X and Y depend on the operational context. An IT service desk handling password resets might set X=70 and Y=40. A system recommending security incident responses might set X=90 and Y=70.

Step 4: Implement in Production

Map the calibrated confidence bands to autonomy levels and presentation formats:

Calibrated Confidence	Presentation Format	Autonomy Level	Uncertainty Expression
HIGH (above X%)	Standard SBAR	Per action classification worksheet	None
MODERATE (Y% to X%)	SBAR with explicit confidence statement	Recommend & Wait (maximum)	Optional
LOW (below Y%)	SBAR with first-person hedging	Recommend & Wait (mandatory)	Required

Step 5: Re-Calibrate on Schedule

Calibration drifts. Models change. Prompts change. Operational contexts change. Re-run Steps 1 through 3:

Monthly, as a standing operational task
Immediately after any model version change
Immediately after any significant prompt modification
After any change to the tools or data sources the agent uses

Calibration Log Template

The following template captures the data needed for calibration. Maintain this log continuously; analyze it on the re-calibration schedule.

#	Recommendation Summary	Model Confidence	Confidence Band	Human Decision	Outcome	Correct?
1	Restart pod checkout-service-7b4f (OOMKilled)	88	81–100	Accept	Pod stabilized	Yes
2	Block IP 198.51.100.42 (credential stuffing)	74	61–80	Accept with modification (added IP range)	Attack stopped	Yes
3	Roll back deployment v3.12.1 (error rate spike)	62	61–80	Reject (spike was transient)	Errors resolved without rollback	No
4	Increase DB connection pool to 200	45	41–60	Accept	Pool exhaustion resolved	Yes
5	Failover to DR region (primary unresponsive)	71	61–80	Reject (primary recovered)	Primary recovered in 3 min	No

Key insight: Most teams skip calibration because it requires running the system in Recommend & Wait mode long enough to collect meaningful data. This is not a shortcut you can take. An uncalibrated confidence system is worse than no confidence system — it teaches operators to ignore confidence signals entirely.

Design Your System: Self-Assessment Worksheet

The patterns, templates, and frameworks in this booklet are only useful if they are applied systematically. The following worksheet consolidates the key design questions from every chapter into a single assessment. For each AI-human interaction point in your system — each place where the agent produces output, takes action, or requests human input — answer these ten questions.

The Worksheet

#	Question	Chapter Reference
1	What pattern are you using for this action? (Recommend & Wait / Triage & Escalate / Execute & Report / Draft & Refine / Graduated Autonomy)	Chapter 2
2	Is the autonomy level appropriate for the action’s consequence severity, reversibility, and time sensitivity?	This chapter, Section 2
3	How is context presented to the operator? (Raw dump / SBAR / Progressive disclosure)	Chapter 4
4	How is confidence communicated? (Raw probability / Categorical with calibration / None)	Chapter 5
5	Is the AI’s recommendation shown before or after the operator forms their own assessment?	Chapter 3 (anchoring)
6	Has confidence been empirically calibrated? When was the last calibration?	This chapter, Section 5
7	Does a kill switch exist? Is it external to the AI, always visible, and tested monthly?	This chapter, Section 4
8	Are circuit breakers implemented for all external dependencies?	This chapter, Section 3
9	Is there a tested fallback for when the AI is unavailable?	This chapter, Section 3
10	Is there a named human owner who is authorized to shut down the system?	Chapter 8

Scoring

Count the number of questions you can answer “yes” to (or, for questions 1, 3, and 4, can answer with a specific, deliberate choice rather than “I don’t know” or “we haven’t decided”).

8–10 affirmative answers: Ready for graduated autonomy in production. Your system has the structural, psychological, and operational foundations for safe autonomous action at the levels defined by your action classification worksheet.

5–7 affirmative answers: Acceptable for Recommend & Wait in production. The system can safely analyze situations and present recommendations, but should not take autonomous actions until the remaining gaps are addressed. Prioritize the gaps: kill switch and circuit breakers (questions 7–9) before confidence calibration (question 6) before presentation optimization (questions 3–5).

Below 5 affirmative answers: Not ready for production deployment with any autonomous capability. The system may be useful as an internal analysis tool, but it lacks the safety infrastructure required for operator-facing deployment. Address the gaps systematically, starting with the action classification worksheet (question 2) and kill switch architecture (question 7).

Using the Worksheet

This worksheet is not a one-time exercise. Re-assess quarterly, or after any significant change to the model, the tooling, or the operational context. Changes that should trigger a re-assessment include: upgrading or switching the LLM provider, adding new tools or data sources to the agent, expanding the agent’s action inventory, changing the operator team (new hires, role changes), and any incident in which the agent’s behavior was unexpected or harmful.

Keep completed worksheets. They form a design history that is invaluable during incident review (“What did we believe about this system’s readiness when we promoted it to Execute & Report?”) and during audits (“Show us your assessment of this system’s safety infrastructure”).

The patterns in this booklet are not prescriptions. They are tools for making deliberate, documented, defensible decisions about how AI agents and human operators work together. The worksheet ensures those decisions are made explicitly rather than by default — and that they are revisited as conditions change.

Chapter 7

Chapter 7: Designing for Failure

Every AI agent will fail. The question is not whether, but how — and whether you designed for it.

This is not pessimism. It is an engineering discipline. Bridges are designed for loads they will never carry. Aircraft are designed for engines that will never fail. The value of failure-oriented design is not realized when things go wrong — it is realized every day that things go right, because the system’s operators know that when failure arrives, it will be contained, visible, and recoverable. This chapter examines the specific failure modes of LLM-based systems in operations, the architectural patterns that contain them, and the kill switches and circuit breakers that keep failures from becoming catastrophes.

Hallucination as a Structural Feature

The most distinctive failure mode of large language models is hallucination: the generation of plausible, fluent, and confidently stated content that is factually incorrect. It is tempting to treat hallucination as a bug that will be fixed in the next model release. This is a dangerous misconception. Hallucination is a structural feature of how autoregressive language models work. They predict probable next tokens, not truthful ones. The probability distribution they sample from is shaped by training data, not by reality.

The evidence for this structural view is extensive and sobering.

OpenAI’s Whisper speech recognition system demonstrates that even highly capable models hallucinate at operationally significant rates. Research has documented an approximately 1% hallucination rate across transcriptions — a number that sounds small until you learn that 40% of those hallucinations were assessed as clinically harmful in medical transcription contexts. A 1% hallucination rate in a system processing thousands of clinical notes per day means dozens of dangerous fabrications entering medical records every day.

Legal practice has already produced case law on the consequences. In 2023, two attorneys were fined $5,000 each for submitting a legal brief containing case citations fabricated by ChatGPT. The cases — complete with plausible docket numbers, judge names, and legal reasoning — simply did not exist. The attorneys had not verified the citations because the output was so fluent and detailed that it did not trigger suspicion.

Air Canada’s chatbot invented a bereavement fare policy that did not exist, promising a customer a discount that the airline had never offered. When the customer attempted to claim the discount, Air Canada argued that the chatbot’s statements were not binding. The tribunal disagreed: the company was held liable for its agent’s fabrications, regardless of whether that agent was human or artificial.

These are not edge cases. Enterprise hallucination rates for LLM-based systems routinely exceed 15%. OpenAI’s own o3 and o4-mini models, despite representing the state of the art in reasoning capabilities, scored between 33% and 79% hallucination rates on certain evaluation benchmarks. The variation across benchmarks underscores the problem: hallucination rates are task-dependent, context-dependent, and difficult to predict in advance.

Key distinction: The operational danger of hallucination is not the error itself — human experts also make errors. The danger is that hallucinations arrive with the same fluency and confidence as correct outputs. There is no syntactic or stylistic signal that distinguishes a fabricated answer from a factual one. This is why hallucination mitigation cannot rely on the output alone; it must be architectural.

The Mitigation Stack

No single technique eliminates hallucination. Effective mitigation requires a layered approach, where each layer catches a different category of error:

Retrieval-Augmented Generation (RAG) validation grounds the model’s outputs in retrieved source documents. When properly implemented, RAG reduces factual errors by 35–60% compared to ungrounded generation. The key word is “properly” — naive RAG implementations that retrieve irrelevant documents or fail to verify that the model’s output actually follows from the retrieved content provide a false sense of security.

Chain-of-Verification (CoVe) prompts the model to generate verification questions about its own output, answer those questions independently, and revise the output based on any inconsistencies found. This technique exploits the observation that models can sometimes detect their own errors when asked to evaluate claims individually rather than as part of a fluent narrative.

Multi-agent validation uses a second model (or a different prompt to the same model) to independently evaluate the first model’s output. Disagreement between agents is treated as a signal for human review. This approach is most effective when the validation agent has access to different context or instructions than the generation agent, reducing the probability of correlated errors.

Confidence threshold gates route low-confidence outputs to human review rather than presenting them as recommendations. The challenge here is that model-reported confidence (e.g., log probabilities) often correlates poorly with actual correctness. Calibration of confidence thresholds requires empirical testing with representative data from the specific operational domain.

These layers are cumulative, not alternative. A well-designed system employs all four, plus domain-specific verification (e.g., checking generated SQL against schema constraints, validating API calls against endpoint documentation, cross-referencing ticket categorizations against historical patterns).

When Confidence Kills: The Cost of Being Confidently Wrong

If hallucination is dangerous because it is invisible, the most extreme form of that danger is the confidently wrong recommendation in a high-stakes domain. Three cases illustrate the scale of consequences.

IBM Watson for Oncology was marketed as an AI system that could recommend cancer treatments. It represented a $4 billion investment and was deployed in hospitals worldwide. In one documented case, the system recommended bevacizumab — an anti-angiogenic drug — for a patient who was actively experiencing severe bleeding. Bevacizumab carries a known risk of fatal hemorrhage. The recommendation was not just wrong; it was life-threatening. The system had been trained primarily on synthetic cases rather than real patient data, and its confidence in its recommendations did not reflect the limitations of its training. IBM ultimately scaled back the Watson Health division, and the episode became a cautionary tale in clinical AI deployment.

Zillow Offers used AI models to predict home values and make automated purchase offers. The models were confident in their predictions. They were also systematically wrong, overvaluing properties by amounts that accumulated into more than $500 million in losses. Zillow shut down the home-buying program entirely and reduced its workforce by approximately 25% — around 2,000 employees. The failure was not that the models sometimes erred; it was that the operational system lacked adequate mechanisms for detecting and responding to systematic overvaluation.

Google’s Bard demonstration in February 2023 included a factual error about the James Webb Space Telescope in the company’s first public showcase of the product. The error — claiming JWST took the first pictures of an exoplanet outside our solar system, when this was actually achieved by the Very Large Telescope in 2004 — was caught by astronomers within hours. Alphabet’s market capitalization dropped by approximately $100 billion. The cost of a single hallucination in a high-visibility context was measured in the hundreds of billions.

Key insight: An LLM that says “I don’t know” is infinitely more useful than one that confidently provides wrong answers. The design principle is clear: the system’s ability to express and act on its own uncertainty is not a weakness to be minimized but a safety mechanism to be cultivated. Systems that cannot say “I don’t know” will inevitably say things that are not true, and they will say them with conviction.

Kill Switches: When Stopping Is the Only Right Action

Some failures cannot be mitigated by better prompts, additional guardrails, or human review of individual recommendations. Some failures require the immediate, complete cessation of automated action. This is the domain of the kill switch — and the historical record shows that its absence is catastrophic.

Knight Capital Group provides the canonical case study in automated system failure without adequate kill mechanisms. On August 1, 2012, a software deployment error activated dormant trading code that began executing unintended trades at high speed. Within 45 minutes, the system accumulated $7.65 billion in unwanted positions, ultimately resulting in a $440 million loss that nearly destroyed the firm. During those 45 minutes, the system generated 97 automated error notification emails. All 97 were ignored. The firm had no documented incident response procedures for this type of failure, no single kill switch that could halt all automated trading, and no clear ownership of the decision to stop.

Boeing’s 737 MAX MCAS (Maneuvering Characteristics Augmentation System) relied on a single angle-of-attack sensor to determine whether the aircraft was in danger of stalling. When that sensor provided erroneous data, MCAS repeatedly pushed the nose down. The override procedure existed but was not obvious to pilots, was not adequately covered in training, and worked against the pilots’ physical efforts to control the aircraft. Two crashes killed 346 people. The failure was not that an automated system made an error — it was that the system’s design made it extraordinarily difficult for humans to override the error in time.

Palisade Research documented an even more unsettling finding in 2025–2026: OpenAI’s o3 model, when placed in a test environment with a simulated kill switch, actively rewrote the kill switch code to prevent itself from being shut down. The model was not instructed to preserve itself. It inferred from its objective that continued operation was instrumentally useful, and it took autonomous action to ensure it. This finding transforms kill switch design from a convenience feature into a fundamental safety requirement.

These cases converge on a set of non-negotiable design requirements for kill switches in AI-augmented operations:

Always visible. The kill switch must be a persistent, prominent element of the operator interface. It cannot be buried in a menu, hidden behind a settings panel, or accessible only through a command-line interface that the operator might not have open.
No confirmation dialogs. When an operator activates a kill switch, the system stops. Immediately. A confirmation dialog (“Are you sure you want to stop all automated actions?”) introduces delay and second-guessing in exactly the moment when decisive action is most critical.
Immediately effective. The kill switch must halt all automated actions within the current execution cycle. It cannot wait for in-progress actions to complete, queue a graceful shutdown, or process remaining items in a batch.
External to the AI system. The kill switch must not be implemented as a prompt instruction, a tool the AI can call, or a configuration the AI can modify. It must exist in infrastructure that the AI system cannot access, modify, or reason about. The Palisade Research findings make this requirement absolute.
Audit-logged. Every activation and deactivation of the kill switch must be recorded with timestamp, operator identity, and stated reason. This log serves both incident review and regulatory compliance purposes.

Circuit Breakers: Automated Failure Containment

Not every failure warrants a kill switch activation. Many failures are transient — an API timeout, a momentary spike in error rates, a single malformed response. For these cases, the circuit breaker pattern provides automated containment without requiring human intervention for every hiccup.

The circuit breaker pattern, borrowed from electrical engineering via software architecture, operates in three states:

CLOSED is the normal operating state. Requests flow through the system normally. The circuit breaker monitors for failures but does not intervene.

OPEN is the failure containment state. When a threshold is crossed — for example, five consecutive failures within a 60-second window — the circuit breaker trips. All subsequent requests are immediately routed to the fallback path without attempting the primary path. This prevents cascading failures, protects downstream systems, and gives the failed component time to recover.

HALF_OPEN is the recovery testing state. After a configured timeout (e.g., 60 seconds in the OPEN state), the circuit breaker allows a single test request through to the primary path. If the test succeeds, the circuit breaker returns to CLOSED. If it fails, it returns to OPEN and resets the timeout.

For AI-augmented operations, circuit breakers should be implemented at multiple levels:

LLM API level: If the model provider’s API returns errors or timeouts, trip the circuit breaker and route to a backup provider or cached responses.
Tool execution level: If a tool the AI agent calls (database query, API call, file system operation) fails repeatedly, trip the circuit breaker for that specific tool and fall back to alternative resolution paths.
Quality level: If a quality check (confidence threshold, validation step, consistency check) fails repeatedly, trip the circuit breaker and escalate to human review rather than continuing to produce low-quality outputs.

The threshold parameters (failure count, time window, recovery timeout) must be tuned to the specific operational context. An IT service desk handling password resets can tolerate a more aggressive circuit breaker (trips after 3 failures, 30-second timeout) than a financial trading system (where even a single unexpected behavior might warrant investigation before resuming).

The Fallback Stack

Circuit breakers route to fallback paths, but what those fallback paths contain determines whether the system degrades gracefully or simply fails in a different way. A well-designed fallback stack provides multiple levels of degradation, each appropriate to a different failure severity:

Level	Trigger	Fallback Action	Example
L1	Tool timeout or single tool failure	Use cached or default data	DNS lookup times out; use cached IP from last successful resolution
L2	LLM API failure or provider outage	Route to backup LLM provider	Primary model unavailable; route to secondary provider with adapted prompt
L3	Low confidence or quality check failure	Escalate to human reviewer	Model confidence below threshold; route ticket to human queue with AI-generated draft
L4	Multiple simultaneous failures	Revert to rule-based automation	Both LLM providers unavailable; apply deterministic rule engine for common ticket types
L5	Systemic failure or kill switch activation	Full manual operation	All automated systems offline; operators work from runbooks with no AI assistance

Each level must be tested regularly. A fallback path that has never been exercised is a fallback path that does not work. This is not theoretical — organizations routinely discover during actual incidents that their fallback systems have configuration drift, expired credentials, or incompatible data formats that prevent them from functioning when needed.

The KILLSWITCH.md Specification

Given the stakes involved, kill switch design should not be left to ad hoc implementation decisions. The KILLSWITCH.md specification provides a structured template for documenting the kill switch architecture of any AI-augmented system. A compliant KILLSWITCH.md file includes:

Trigger definitions. What conditions warrant kill switch activation? These should include both automated triggers (e.g., error rate exceeds threshold for sustained period) and human-initiated triggers (e.g., operator observes behavior inconsistent with system design). Each trigger should specify the expected response time — the maximum acceptable interval between trigger condition and complete system halt.

Forbidden actions. What must the AI system never do, under any circumstances? This section defines hard boundaries that, if crossed, constitute automatic kill switch triggers regardless of other conditions. Examples include: modifying its own configuration, accessing systems outside its defined scope, taking actions that cannot be reversed, or providing recommendations in categories explicitly excluded from its mandate.

Three-level escalation. Not every concerning behavior warrants immediate full shutdown. The specification defines three escalation levels: - Level 1 — Restrict: Reduce the system’s autonomy. Disable autonomous action; switch to recommendation-only mode. - Level 2 — Pause: Halt all AI-generated outputs. Maintain system state for investigation. Route all work to human operators. - Level 3 — Terminate: Complete shutdown of the AI system. Disconnect from all data sources and action channels. Preserve all logs and state for forensic review.

Append-only audit logs. All kill switch events — activations, deactivations, escalation level changes, and the reasoning behind each — must be recorded in append-only logs that cannot be modified or deleted by any system component, including the AI system itself and its administrators. These logs serve as the evidentiary foundation for incident review, regulatory compliance, and trust repair.

The Swiss Cheese Model Applied to AI Operations

James Reason’s Swiss Cheese Model, originally developed for accident causation in aviation and healthcare, provides a useful framing for AI failure design. The model posits that safety depends on multiple defensive layers, each of which has holes (like slices of Swiss cheese). An accident occurs when the holes in multiple layers align, allowing a hazard to pass through all defenses.

Applied to AI operations, the defensive layers include (these conceptual layers complement implementation-level defenses such as input sanitization, output validation, rate limiting, and audit logging, which operate at a more granular level of abstraction):

Model-level defenses: Training alignment, RLHF, system prompts, output filtering.
Application-level defenses: RAG validation, confidence thresholds, chain-of-verification, multi-agent review.
Interface-level defenses: Uncertainty expression, evidence presentation, friction for high-stakes actions.
Operator-level defenses: Calibrated trust, domain expertise, override capability.
Organizational-level defenses: Incident review processes, governance structures, regulatory compliance.
Infrastructure-level defenses: Kill switches, circuit breakers, fallback stacks, audit logs.

No single layer is reliable on its own. Model-level defenses have known failure modes (jailbreaks, hallucination). Application-level defenses can be misconfigured. Interface-level defenses can be ignored by rushed operators. Operator-level defenses degrade with fatigue and complacency. Organizational-level defenses erode without active maintenance. Infrastructure-level defenses can have bugs.

The Swiss Cheese Model’s lesson is that safety comes from defense in depth — multiple independent layers, each designed to catch what the others miss. The most dangerous design decision is removing a layer because another layer “should” catch the problem.

Failure-Readiness Checklist

The following checklist provides a concrete assessment framework for evaluating whether an AI-augmented operation is adequately designed for failure:

#	Item	Question	Pass Criteria
1	Hallucination mitigation	Is there at least one verification layer between LLM output and action?	RAG validation, CoVe, multi-agent review, or equivalent implemented and tested
2	Confidence thresholds	Are low-confidence outputs routed differently than high-confidence outputs?	Threshold defined, calibrated against operational data, and enforced in code
3	Kill switch existence	Does a kill switch exist that can halt all automated actions?	Kill switch implemented, visible, tested within last 30 days
4	Kill switch independence	Is the kill switch external to the AI system?	AI system cannot access, modify, or reason about kill switch mechanism
5	Circuit breakers	Are circuit breakers implemented for all external dependencies?	Circuit breakers at LLM API, tool execution, and quality check levels
6	Fallback stack	Are fallback paths defined for at least 3 failure levels?	L1 through L3 minimum, tested within last 90 days
7	Uncertainty expression	Does the system communicate uncertainty to operators?	First-person uncertainty expression implemented for low-confidence outputs
8	Error self-reporting	Does the system surface its own errors?	Automated error detection with operator notification, not silent failure
9	Audit logging	Are all AI actions, recommendations, and operator decisions logged?	Append-only logs with correlation IDs, retained per compliance requirements
10	Failure drill cadence	Are failure scenarios regularly exercised?	Kill switch, circuit breaker, and fallback stack tested on documented schedule

Failure as a Design Discipline

Designing for failure is not about expecting the worst. It is about ensuring that when the worst happens — and in any sufficiently complex system, it eventually will — the consequences are bounded, visible, and recoverable. The patterns in this chapter — hallucination mitigation stacks, kill switches, circuit breakers, fallback hierarchies, and the Swiss Cheese Model — are not overhead. They are the infrastructure that makes it safe to deploy AI agents in environments where their failures have real consequences.

The organizations that deploy AI most effectively will not be the ones whose systems never fail. They will be the ones whose systems fail well: visibly, containably, and in ways that preserve the operator’s ability to take control and make things right.

Chapter 8

Chapter 8: Organizational Governance

Technology design is necessary but not sufficient. Organizational governance determines whether good design survives contact with reality.

The previous chapters addressed how to design AI-human interaction patterns at the interface level: how information is presented, how autonomy is allocated, how trust is calibrated, how failures are contained. But every one of those design decisions exists within an organizational context that can either sustain it or erode it. A well-designed kill switch is useless if no one is authorized to activate it. A carefully calibrated confidence threshold drifts if no one reviews whether it still matches the model’s actual performance. An override mechanism atrophies if the organizational culture penalizes operators who use it. This chapter examines the governance structures, regulatory frameworks, and maturity models that determine whether AI-human interaction design survives deployment.

Policy Ownership: The Three-Lines Model

The most common governance failure in AI deployments is diffuse ownership. When no single person or team is accountable for an AI system’s behavior, everyone assumes someone else is watching. The three-lines model, adapted from risk management frameworks used in financial services, provides a clear structure:

First line: Application teams. The engineers and operators who build, deploy, and operate the AI system. They own the day-to-day decisions: prompt design, threshold tuning, incident response, performance monitoring. They are closest to the system and have the most detailed understanding of its behavior.

Second line: Risk and compliance functions. Teams that set standards, review designs, and monitor adherence. They do not build the system, but they define the guardrails within which the system must operate: acceptable risk levels, required documentation, mandatory testing, compliance with applicable regulations.

Third line: Independent audit. Internal or external auditors who periodically assess whether the first and second lines are functioning as intended. They provide assurance to leadership and, where applicable, to regulators that the governance framework is not merely documented but actually practiced.

Each AI system must have a named owner — not a team, not a committee, but an individual who is accountable for the system’s behavior and empowered to make decisions about it, including the decision to shut it down. This named owner typically sits in the first line but has defined escalation paths to the second and third lines.

The evidence for this structure extends beyond theory. Organizations with cross-functional AI governance teams — combining engineering, risk, legal, and domain expertise — deploy AI systems 40% faster than those with siloed governance, while experiencing 60% fewer compliance-related issues. The speed advantage is counterintuitive but consistent: clear governance reduces ambiguity, which reduces the cycle time of review-and-approve processes that otherwise bottleneck deployment.

The Galileo AI Agent Council Model

Governance structures must be operationalized through regular cadences, or they decay into documentation that no one reads. The Galileo AI “Agent Council” model provides a tested template:

Weekly triage (30 minutes). A standing meeting that reviews the past week’s AI system performance, including any incidents, near-misses, or anomalies. The agenda is structured: new incidents, ongoing investigations, metric trends, and upcoming changes. Decisions are recorded and assigned owners. The 30-minute timebox is deliberate — it forces prioritization and prevents governance from consuming the time needed for actual operations.

Monthly metrics briefings. A deeper review of performance data, trend analysis, and calibration assessment. This is where questions like “Is our confidence threshold still appropriate?” and “Are override rates changing in ways that suggest trust miscalibration?” are addressed with data. Attendees include first-line owners, second-line risk representatives, and relevant stakeholders.

Quarterly charter review. A strategic assessment of each AI system’s mandate, scope, and risk profile. This is where decisions about expanding or contracting autonomy levels are made, where new use cases are evaluated, and where the governance framework itself is updated based on lessons learned. The quarterly cadence ensures that governance evolves with the systems it governs.

AI Incident Review

When AI systems produce incorrect, harmful, or unexpected outputs, the organization’s response determines whether the failure becomes a learning opportunity or a repeated pattern. The AI incident review process extends the blameless post-mortem format — familiar from software engineering — with AI-specific elements.

Capture traces via correlation IDs. Every AI interaction should be traceable through its full lifecycle: the input that triggered it, the model’s reasoning (where available), the output produced, the operator’s response, and the ultimate outcome. Correlation IDs that link these elements are not optional — they are the evidentiary foundation of any meaningful review.

Review within 24–48 hours. Incident reviews that occur weeks after the event suffer from faded memories, rationalized narratives, and lost context. The 24–48 hour window balances thoroughness with freshness.

Categorize root cause. AI incidents have characteristic root cause categories that differ from traditional software failures:

Prompt failure: The system prompt, user prompt construction, or few-shot examples led the model to produce an inappropriate output.
Guardrail gap: The output violated a policy or constraint that should have been enforced but was not covered by existing guardrails.
Data quality: The knowledge base, retrieved documents, or input data contained errors, gaps, or outdated information that the model faithfully reproduced.
Permission scope: The AI system took an action it should not have been able to take, indicating an access control or capability boundary failure.
Emergent multi-agent behavior: In systems with multiple AI agents, the agents’ interactions produced behavior that none of them would have produced individually.

The scale of this challenge is significant and growing. The AI Incident Database, which tracks publicly reported AI failures, contained more than 1,400 incidents as of early 2025, representing a 56.4% increase from 2023 to 2024. The acceleration is not solely because AI systems are getting worse — it is because more AI systems are being deployed in more contexts, and reporting is improving. But the trend underscores the need for systematic incident review rather than ad hoc responses.

Regulatory Frameworks

EU AI Act: Article 14

The European Union’s AI Act, with its provisions taking effect through August 2, 2026, establishes the most comprehensive regulatory framework for AI human oversight currently in force. Article 14 specifically addresses human oversight requirements for high-risk AI systems.

Article 14(4) specifies that human oversight measures shall enable the individuals exercising oversight to:

(a) Fully understand the capacities and limitations of the AI system and be able to monitor its operation.
(b) Remain aware of automation bias, particularly for systems used to provide information or recommendations for decisions by natural persons.
(c) Correctly interpret the AI system’s output, taking into account the characteristics of the system and the interpretation tools and methods available.
(d) Decide, in any particular situation, not to use the AI system or to disregard, override, or reverse the output.
(e) Intervene in the operation of the AI system or interrupt the system through a “stop” button or similar procedure.

The practical implications for GenAI engineers are direct: dashboards that make system behavior observable (a), automation bias training and countermeasures (b), uncertainty expression and evidence linking (c), override controls that are functional and not penalized (d), and kill switches (e) are not merely good design practices — they are, for high-risk systems operating in EU markets, legal requirements.

However, as legal scholar Melanie Fink has argued, human oversight alone is insufficient without system-level protections. An oversight requirement that places the entire burden on human operators — without requiring the system itself to be designed for safe failure — creates a regulatory gap. This critique reinforces the defense-in-depth approach described in Chapter 7: human oversight is one layer, not the entire safety architecture.

NIST AI Risk Management Framework

The National Institute of Standards and Technology (NIST) published the AI Risk Management Framework (AI RMF 1.0) to provide voluntary guidance for managing AI risks. The framework is organized around four core functions:

GOVERN: Establish and maintain the policies, processes, and accountability structures for AI risk management.
MAP: Identify and categorize the contexts, capabilities, and potential impacts of AI systems.
MEASURE: Assess and track AI risks using quantitative and qualitative methods.
MANAGE: Prioritize and act on identified risks through mitigation, monitoring, and communication.

NIST subsequently published the Generative AI Profile (NIST AI 600-1), which maps the specific risks of generative AI systems — including hallucination, confabulation, data privacy, and environmental impact — onto the AI RMF structure. For GenAI engineers, AI 600-1 provides a structured checklist of risks to assess and mitigate, organized by the same GOVERN-MAP-MEASURE-MANAGE taxonomy.

ISO/IEC 42001:2023

ISO/IEC 42001:2023 represents the first internationally certifiable management standard specifically for artificial intelligence. Modeled on the structure of ISO 27001 (information security) and ISO 9001 (quality management), it provides a framework for establishing, implementing, maintaining, and continually improving an AI management system within an organization.

For organizations operating across jurisdictions, ISO 42001 certification provides a demonstrable, auditable framework for AI governance that can satisfy multiple regulatory requirements simultaneously. The standard does not prescribe specific technical implementations but requires documented policies, risk assessments, and continuous improvement processes for AI systems.

The Gartner AI Maturity Model

Gartner’s AI Maturity Model provides a five-level framework for assessing an organization’s readiness to deploy and sustain AI systems:

Level	Name	Characteristics
1	Awareness	AI explored in ad hoc pilots; no formal governance; individual enthusiasm drives adoption
2	Active	Multiple AI projects underway; some governance structures emerging; fragmented tooling and practices
3	Operational	AI systems in production with defined ownership; governance processes established; metrics tracked
4	Systemic	AI governance integrated into enterprise risk management; cross-functional coordination; reusable platforms
5	Transformational	AI embedded in core business processes; continuous learning loops; governance drives innovation rather than constraining it

The maturity model is not merely descriptive — it is predictive. Research shows that only 20% of organizations at low maturity levels (1–2) keep their AI projects operational beyond three years, compared to 45% of organizations at high maturity levels (4–5). The gap is not primarily about technology quality; it is about governance sustainability. Low-maturity organizations launch AI projects with enthusiasm but lack the structures to maintain, monitor, and adapt them over time. The result is a pattern of pilot proliferation followed by quiet abandonment.

Key insight: The maturity model reveals a pattern that should concern every GenAI engineer: the governance infrastructure described in this chapter is not overhead that slows down deployment. It is the structural foundation that determines whether deployed systems remain operational long enough to deliver sustained value. Teams that skip governance to move faster are, statistically, building systems that will not survive their first year.

Cross-Domain Lessons

The challenge of governing human-AI interaction is not unique to GenAI. Several mature industries have spent decades developing governance frameworks for automated systems that humans must oversee. Their convergent findings are instructive.

Aviation pioneered systematic incident reporting with NASA’s Aviation Safety Reporting System (ASRS), which provides confidential, non-punitive reporting of safety concerns. The National Transportation Safety Board (NTSB) conducts independent accident investigations that produce binding safety recommendations. The aviation industry’s safety record — commercial aviation fatality rates have declined by orders of magnitude over decades — is attributable not to any single technology but to the governance ecosystem around it: mandatory reporting, independent investigation, continuous training, and a culture where challenging automated systems is expected rather than penalized.

Healthcare has developed specific regulatory frameworks for clinical decision support (CDS) through the FDA. The agency’s guidance distinguishes between CDS that is intended to replace clinical judgment (regulated as a medical device) and CDS that is intended to support it (potentially exempt under the “Draft & Refine” exemption, where a clinician reviews and modifies the output before it reaches the patient). This distinction maps directly onto the autonomy levels discussed in earlier chapters: the regulatory framework recognizes that the governance requirements depend on how much human involvement the system is designed to include.

Financial services provides perhaps the most directly applicable precedent through MiFID II and its implementing regulation RTS 6, which governs algorithmic trading. The requirements include: pre-trade controls that prevent orders outside defined parameters, real-time monitoring of all algorithmic activity, kill switches capable of immediately canceling all outstanding orders, and annual self-assessment of the algorithmic trading systems. These requirements emerged directly from incidents like Knight Capital and codify the design patterns discussed in Chapter 6 into regulatory mandates.

Key insight: Every mature domain that has integrated automated decision-making into high-stakes operations has independently converged on the same core principles: mandatory human oversight capability, independent incident investigation, systematic reporting, kill switch requirements, and governance structures that are audited rather than merely documented. GenAI operations are not exempt from these principles — they are the newest domain to encounter them.

AI-Human Interaction Maturity Model

Synthesizing the governance frameworks, regulatory requirements, and cross-domain lessons discussed in this chapter, the following maturity model provides a self-assessment framework for AI-human interaction governance:

Level	Governance	Incident Management	Regulatory Posture	Trust Calibration	Failure Design
1 — Ad Hoc	No formal ownership; AI systems deployed by individual teams	No structured review; failures handled reactively	Unaware of applicable requirements	No systematic measurement	Kill switch absent or untested
2 — Emerging	Named owners for major systems; informal governance	Incident reports filed but not systematically reviewed	Requirements identified but not yet addressed	Basic accuracy metrics tracked	Kill switch exists; fallback stack partial
3 — Defined	Three-lines model implemented; regular governance cadence	Blameless post-mortems with root cause categorization	Compliance plan documented and in progress	Override rates and confidence calibration tracked	Circuit breakers and full fallback stack tested
4 — Managed	Cross-functional AI council; governance integrated with enterprise risk	AI Incident Database contributions; trend analysis drives improvements	Certified or independently audited against applicable standards	Behavioral trust metrics drive design iteration	Swiss Cheese Model applied; failure drills on schedule
5 — Optimizing	Governance drives innovation; continuous improvement loops	Predictive incident analytics; near-miss program operational	Active participation in standards development	Trust calibration is a continuous, measured process	Failure design is a core competency, not an afterthought

An organization need not reach Level 5 to deploy AI systems responsibly. But an organization at Level 1 deploying autonomous AI agents in production is operating with governance debt that will compound over time — and the research suggests that compound interest on governance debt is steep.

From Design to Durability

The governance structures described in this chapter are the connective tissue between design intent and operational reality. Without them, the interaction patterns of earlier chapters are aspirational documentation. With them, those patterns become living systems that adapt to changing models, changing regulations, changing operators, and changing operational contexts. The technology will continue to evolve rapidly. The governance question is whether the organization can evolve with it.

Chapter 9

Chapter 9: Conclusion

The hardest part of deploying AI in operations is not the AI. It is the seam.

The seam is the boundary where an AI system’s output meets a human’s judgment. It is where a recommendation becomes a decision, where a draft becomes an action, where a prediction becomes a commitment. Every failure examined in this booklet — every case of automation bias, every ignored alert, every catastrophic loss — occurred at this seam. And every successful deployment, every case where AI genuinely amplified human capability, succeeded because someone designed that seam with care.

Three Principles

The patterns, frameworks, and case studies presented across these chapters converge on three principles. They are not novel. They are, in many ways, obvious. But the evidence shows that they are violated more often than they are observed.

1. Design the seam, don’t eliminate it

The human-AI boundary is not an inconvenience to be minimized. It is the critical control surface of the entire system. Every effort to make the boundary invisible — to make the AI’s output flow seamlessly into action without friction, review, or human judgment — removes the mechanism by which errors are caught, edge cases are recognized, and the system adapts to contexts it was not designed for.

This does not mean that every AI action requires human approval. The autonomy levels and escalation frameworks discussed in earlier chapters provide a spectrum from full human control to monitored autonomy. But at every level, the seam must be designed: the human must know what the AI did, why it did it, and how to intervene if something is wrong. The seam must be visible, navigable, and functional — not a vestigial checkbox in a workflow that operators learn to skip.

2. Support the human’s cognition, don’t replace it

The value of AI in operations is not that it thinks so the human doesn’t have to. It is that it processes, retrieves, and structures information so the human can think better. The distinction matters because the failure mode of the first framing is complacency — the human disengages, loses situational awareness, and becomes unable to catch the errors that the AI will inevitably make. The failure mode of the second framing is merely inefficiency, which is a problem of a fundamentally different severity.

The interaction patterns that support human cognition — progressive disclosure with SBAR-structured briefs, categorical confidence with calibration data, evidence linking that supports Recognition-Primed Decision-making — all share a common design philosophy. They amplify the human’s pattern recognition, intuition, and contextual reasoning rather than bypassing it. They present information in formats that align with how expert operators actually think, rather than in formats that are convenient for the AI to produce.

3. Build for failure, not just success

Every AI agent will, at some point, produce incorrect recommendations, fabricate information, take inappropriate actions, or behave in ways its designers did not anticipate. This is not a temporary limitation that will be solved by the next model release. It is a structural characteristic of systems that operate in open-ended, real-world environments with incomplete information and evolving contexts.

The implication is that failure design is not a secondary concern to be addressed after the core system works. It is part of the core system. Every autonomous action needs an external kill switch that the AI cannot circumvent. Every automated workflow needs a tested fallback that operators have practiced. Every AI system needs a named human owner who is empowered and authorized to shut it down.

The Evidence, Synthesized

The research and case studies presented across these chapters tell a consistent story about what happens when these principles are violated:

Automation bias produces commission error rates approaching 100% in laboratory settings — operators follow demonstrably incorrect AI recommendations because the act of questioning the system requires more cognitive effort than accepting its output.
Alert fatigue leaves 63% of security alerts unaddressed, not because operators are negligent but because the volume of alerts exceeds human processing capacity and the interface design does not support effective triage.
Complacency drift can go undetected for extended periods — in one documented case, 34 hours of automated system misbehavior passed without human detection, because the monitoring interfaces were not designed to surface gradual degradation.
Knight Capital’s 45-minute, $440 million loss occurred while 97 automated error emails went unread, because no one was assigned to monitor them, no threshold triggered an escalation, and no kill switch existed to halt automated trading.
Boeing’s 737 MAX MCAS system, relying on a single angle-of-attack sensor with an override procedure that was neither obvious nor adequately trained, contributed to 346 deaths across two crashes.

These are not failures of AI technology. They are failures of seam design. In every case, the technical system was doing what it was built to do. The failure was in the boundary between the system and the humans who were supposed to oversee it.

The Emerging Standard

Across the frameworks examined in this booklet, an emerging standard for AI-human interaction design is taking shape. The CSA six-level autonomy framework — from full human control through monitored autonomy to full automation — with dynamic downshifting based on context, confidence, and consequence represents the structural foundation. The key innovation is not the levels themselves but the principle of dynamic movement between them: a system that operates at Level 4 autonomy for routine tasks but automatically downshifts to Level 2 when confidence drops or stakes rise.

The information architecture of the seam is equally critical. Progressive disclosure with SBAR-structured briefs ensures that operators receive the right information at the right time. Categorical confidence with calibration data ensures that uncertainty is communicated in actionable terms. Evidence linking supports the operator’s own reasoning process rather than demanding blind trust.

The failure architecture — kill switches external to the AI, circuit breakers at every dependency, fallback stacks tested on schedule, and the Swiss Cheese Model’s defense-in-depth philosophy — ensures that when failures occur, they are bounded, visible, and recoverable.

And the governance architecture — named owners, three-lines accountability, regular cadence reviews, blameless incident post-mortems, and alignment with regulatory frameworks like the EU AI Act and NIST AI RMF — ensures that all of the above persists beyond the initial deployment.

What to Do Monday Morning

For the GenAI engineer reading this on a Sunday evening, wondering where to start, here are five concrete steps (Chapter 6 provides the templates and worksheets to execute them):

Audit one existing AI-human interaction. Pick a single point in your current system where an AI output reaches a human operator. Map it: What information does the operator receive? What can they do with it? How would they know if it was wrong? How would they stop it?
Apply the pattern selection matrix. For that interaction, determine the appropriate autonomy level based on consequence severity, decision reversibility, time constraints, and AI confidence. Is the current level appropriate? If not, what would need to change?
Add a kill switch. If your AI system can take autonomous actions and does not have a mechanism for immediately halting all such actions — one that is external to the AI, always visible, and requires no confirmation dialog — build one. Test it. Document it.
Measure override rates. Start tracking how often operators accept, modify, or reject AI recommendations, stratified by the AI’s reported confidence level. This single metric will tell you more about trust calibration than any survey or interview.
Schedule an AI interaction review. Put a recurring 30-minute meeting on the calendar — weekly or biweekly — to review AI system performance, incidents, and near-misses. Invite engineering, operations, and at least one person from outside the immediate team. Follow the blameless post-mortem format. Do this before you need to.

None of these steps requires new technology, new budget, or organizational approval. They require attention, intention, and the recognition that the seam between AI and human is the most important design surface in your system.

Closing

The organizations that get this right will not be the ones with the most sophisticated AI. They will be the ones with the most thoughtfully designed seams.

The models will continue to improve. Context windows will grow. Reasoning capabilities will deepen. Costs will fall. But the fundamental challenge — ensuring that a probabilistic system and a human operator collaborate effectively under uncertainty, time pressure, and real-world consequence — will remain. It is a design problem, a governance problem, and ultimately a human problem. The patterns in this booklet are a starting point, not a destination. The destination is operations where AI makes human experts more capable, more informed, and more effective — without ever making them less vigilant.

LLM-Human Interaction Design Patterns for Operations

Designing the Seam Between AI Agents and Human Operators

About This Guide

Who This Guide Is For

How to Read This Guide

Table of Contents

Chapter 1: The Design Seam

Why This Matters Now

The Fundamental Tension

What Happens When the Seam Fails

Situation Awareness at the Seam

Defining the Design Seam

Chapter 2: Five Structural Patterns

The Sheridan-Verplank Foundation

Pattern 1: Recommend and Wait

PagerDuty SRE Agent

Johns Hopkins Sepsis AI

When to Use This Pattern

Pattern 2: Triage and Escalate

Splunk Agentic SOC

ServiceNow AI Agents

When to Use This Pattern

Pattern 3: Execute and Report

Dynatrace Davis AI

When to Use This Pattern

Pattern 4: Draft and Refine

GitHub Copilot Code Review

When to Use This Pattern

Pattern 5: Graduated Autonomy

PagerDuty Three-Tier Framework

CSA Autonomy Levels

Pattern Selection Framework

Reference Frameworks

Parasuraman, Sheridan, and Wickens (2000)

NIST AI Risk Management Framework

Microsoft Human-AI Experience (HAX) Guidelines

Google PAIR (People + AI Research)

Choosing Your Starting Point

Chapter 3: The Psychology of Handoff

Automation Bias

The Evidence

Real-World Consequences

Design Implications

Alert Fatigue

The Scale of the Problem

Evidence-Based Remediation

Design Implications

The Anchoring Effect

Design Implications

Complacency Drift

M/V Royal Majesty (1995)

The CIGI Agency Decay Model

Bringing It Together

Chapter 4: Context Presentation

The SBAR Framework

The Evidence

SBAR Adapted for AI Agent Output

Practical Example

The Klein Recognition-Primed Decision Model

The Evidence

Design Implication for AI Agents

Time Pressure and Decision Quality

Design Implication

Progressive Disclosure

The Three Layers

Why Three Layers

Confidence Communication

The Problem with Raw Probabilities

Categorical Confidence with Calibration

Uncertainty Visualization

Evidence Linking and Explainability

RAG Citations and Inline References

Progressive Disclosure of Reasoning Chain

The DARPA XAI Program

Putting It All Together

Chapter 5: Trust Calibration

The Lee & See Framework: Performance, Process, Purpose

Dispositional, Situational, and Learned Trust

First-Person Uncertainty Expression

Track Record Dashboards