Never Be On-Call Again

You point an agent at your issue tracker. It reads the stack traces, checks the timeline, and writes you a report. The report is articulate, well-structured, and wrong. It confused two unrelated errors that happened near the same timestamp. It attributed 53 events to a single bug without checking whether the error messages actually matched. It flagged a model name as "invalid" because the model was released after its training cutoff.

Frontier models are good at reading stack traces, parsing logs, and summarising what they find. They're also good at constructing confident-sounding narratives from coincidence.

We've been building a triage agent at brand.ai that synthesises across every observability signal simultaneously. Errors, sessions, analytics, LLM traces, releases, pull requests. Not reacting to one alert at a time, but correlating all of them. It runs as a Claude Code skill, either locally or on CI. No custom platform.

The Single-Source Trap

The obvious approach to agent-driven triage is monitor-driven. Something fires, an agent investigates that specific thing. One signal, one investigation. This works well for errors with clean signal. A stack trace, an error rate spike, a latency regression.

But production issues don't always announce themselves through a single channel. A user's AI generation fails silently (LLM trace shows a latency spike), they rage-click three times (session analytics), a related exception surfaces in the issue tracker five minutes later (different service entirely), and a PR merged that morning touched the same code path (release history).

No single monitor catches that. The pattern only emerges when you look across everything at once.

The Full Stack

Production — 0 events

Our triage agent sees seven categories of signal: error tracking (stack traces, exception metadata), session analytics (user journeys, feature usage), UX signals (rage clicks, dead clicks, web vitals), LLM traces (every AI generation event with model, cost, latency, tokens, and error breakdowns), session replays (network requests with bodies, status codes, and durations), releases and PRs (deploy timestamps, new error groups, recently merged code), and distributed logs from sandboxed AI execution environments.

OpenTelemetry ties them together. OTel spans and logs cross service boundaries, linking a frontend error to an API call to an LLM generation to a sandbox execution. Without that instrumentation layer, the seven sources are disconnected dashboards. With it, the agent can follow a single request through the entire stack.

Each source answers different questions. Your issue tracker shows what threw. Your analytics show who was affected. Your LLM traces show whether the AI pipeline is healthy. Your release history shows what changed. But OTel is what lets the agent correlate across them.

Cluster First, Investigate Second

The naive approach is to fetch everything and hand it to one big investigation prompt. This fails for two reasons: context windows aren't infinite, and the agent drowns in noise before it finds signal.

Our system runs parallel fetch agents, one per data source. Each agent writes its results to files and returns only a manifest: issue IDs, event counts, file paths. Nothing gets loaded into a single context window. The orchestrator sees pointers, not payloads.

Before clustering, the orchestrator scans previous triage reports for overlap. If an issue showed up in yesterday's scan, the agent knows. It can flag it as persistent rather than treating it as new, and it won't waste deep-dive tokens re-investigating something that's already been analysed.

Then it clusters.

Three layers of clustering, each progressively less deterministic:

Fingerprint grouping. Pure pattern matching. Stack trace fingerprints, exception type plus URL pattern, model plus error type. No intelligence required. Just deduplication.

Temporal and contextual. Events within a five-minute window that share a user ID, URL pattern, or trace ID get grouped. An error and an analytics session from the same user within the same minute aren't two incidents. They're one incident observed from two angles.

LLM hypothesis grouping. The agent generates a one-line root cause hypothesis per cluster and proposes merges for similar ones. Every merge at this layer is marked speculative. The default is to under-group. A false merge (two unrelated issues combined) is worse than a missed merge, because it can mask the real cause of either.

Only after clustering does the system dispatch deep-dive investigations, and only on the top clusters ranked by severity.

The Join Problem

Your issue tracker and your analytics platform don't share a primary key. Users have different identifiers in each system. Events land at slightly different timestamps depending on client-side vs server-side collection.

The agent handles this by following breadcrumbs. User IDs from issue tracker tags map to distinct IDs in analytics queries. Some events carry cross-tool references directly. Timestamps within a tolerance window, combined with matching user IDs, create probabilistic links.

When one source lacks the data, check the next. Analytics events first, then issue tracker details, then session replay snapshots, then the codebase itself. No ETL pipeline, no unified data lake. The agent stitches sources together at investigation time.

The Fabrication Problem

Giving an agent access to seven data sources doesn't make it a better investigator. It makes it a more creative storyteller.

An agent that finds an error, a user session near the same timestamp, and a recently merged PR touching the same file will weave a causal narrative. Except the PR might have been merged twelve hours earlier. The session might be a different user on a different page. Three unrelated facts, stitched together with vibes.

We've codified a set of anti-patterns that constrain how the agent reasons about evidence. These aren't prompt suggestions. They're the architecture. The system doesn't work without them.

The gap-to-narrative pipeline. A single unverified assumption ("this model name is invalid") becomes the lens through which all subsequent evidence is interpreted, producing a coherent but fabricated narrative. The antidote: after forming a hypothesis, actively search for evidence that contradicts it before searching for evidence that supports it. If the only evidence for your hypothesis is "I don't recognise this value," you have no evidence.

Fingerprint trust kills accuracy quietly. Your issue tracker groups events by stack trace and exception class, not by root cause. Wrapper exceptions can contain completely different underlying errors under the same issue ID. "53 events" is not "53 instances of the same bug" until you sample the individual events and verify the error messages match.

Write-path blindness sends investigations down the wrong branch. When you find unexpected data (wrong IDs, corrupted metadata, values that shouldn't exist), the instinct is to investigate where the data is consumed. The bug is almost always in how the data was created. "Where did this value come from?" matters more than "why does this value cause an error?"

Training-data confabulation is unique to agent investigators. The agent encounters an identifier, model name, or config value it doesn't recognise, and its first instinct is to assume it's invalid. This instinct is wrong. The agent's training data has a cutoff; the codebase is live. An unrecognised value in deployed, reviewed code deserves the presumption of validity. Corroborate from the codebase before calling anything invalid.

Correlation without causation is the classic. Verify shared trace IDs and timeline order before claiming causal links. Temporal proximity is not evidence.

Mechanism without root cause is the trap that sounds like progress. A race condition is how something broke. The missing guard is why. "What went wrong" isn't the same as "what allowed it to go wrong."

These anti-patterns enforce a broader investigative methodology that agents don't follow by default.

Root cause progression. Every investigation follows the same chain: symptom (what's observed), mechanism (how it broke), enabling condition (what allowed it to break), root cause (why the enabling condition existed). Most agents stop at mechanism. "There's a race condition" is mechanism. "The retry logic was removed in a refactor three weeks ago" is root cause.

Falsification-first reasoning. After forming a hypothesis, the agent's next move isn't to find supporting evidence. It's to find contradicting evidence. If the hypothesis survives attempts to break it, confidence goes up. If it doesn't, you've saved yourself from shipping a wrong report.

Evidence hierarchy. Every finding gets a confidence tag. "Confirmed" requires direct proof: a trace ID linking cause to effect, a reproducing test case. "Probable" means strong circumstantial evidence. "Speculative" means consistent with the data but not proven. "Unverifiable" means the data to confirm or deny doesn't exist. The report carries these tags through to the reader.

Left to their defaults, agents summarise what they see. They don't prove what they claim. The methodology has to be explicit.

The Report

The output is a fully rendered HTML report. Clusters ranked by severity, each tagged as new, escalating, or persistent. Linked issue IDs that click through to your issue tracker, deploy correlation showing which releases introduced new error groups, deep-dive findings with the evidence chain laid out, and a suggested owner based on recent commit and PR history for the affected code paths.

Triage Report2026-03-25 — Last 7 days

31

Issues (Prod)

3 new, 28 persistent

12

Deploys (7d)

3 with new error groups

5

Clusters

2 HIGH, 2 MEDIUM, 1 LOW

C1: AI Stream / API Failures

HIGH3 issues · 36 eventspersistent

1. Unicode line separator in tool results (26 events): Node.js readline treats U+2028/U+2029 as line terminators, splitting JSON-RPC messages mid-line. This produces APICallError. Resolved by PR #2341 (input sanitiser). Zero events after Mar 18.

2. LLM provider rejections (6 events): Provider rejects certain PDFs during analysis with immediate 400 responses (~1s latency). Self-resolved after Mar 18, likely due to caching improvements.

3. Request too large — no size pre-flight guard (4 events): PDFs exceeding provider size limit cause StreamException. File size is fetched via HEAD request but only logged, no guard to skip. Ongoing risk.

src/services/documents/pdf.ts:307-316— HEAD request fetches size but no guard
src/services/ai/completion.ts:751— StreamException throw site
src/services/ai/transport.ts— input sanitiser (PR #2341 fix)

PROJ-24N ×26PROJ-26J ×6PROJ-233 ×4

Affected: 2 brands, multiple users

+ 4 more clusters

Persistent issues get flagged across runs. A reader can scan the report, see what's already been investigated, and focus on what's new. They can click an issue ID, read the stack trace themselves, and check the agent's reasoning.

A full scan takes about ten minutes and costs around $4 in model usage.

The Takeaway

No human on-call is checking seven sources, clustering related events, correlating deploys to error spikes, and sampling individual events to verify they're actually the same bug. They check one or two tools, form a hypothesis, and go with it. The agent does every pass, every time, across everything.