You launched an AI assistant. Do you know if it's working?

This article was authored by the Whitespectre Product team from a presentation by Whitespectre Senior Product Manager Lucilla Senador.

Your team recently launched an AI assistant for your product. The business case was compelling, the beta was promising, users are excited.

Now, weeks in, your customer support team flags a growing number of user complaints- "The bot didn't understand. It ignored part of my question. The data was wrong. It feels 'off'."

Your team pulls the relevant session transcripts. They are poor: answers that don't make sense, dropped responses, visibly frustrated users, etc. Suddenly it’s all hands on deck to try and find out what’s gone wrong.

The problem? Without a framework in place, there's no fast way to diagnose the problems or know how deep they reach.

What's not visible: How much & why?

An individual transcript shows a conversation went off track. It doesn't show you the bigger picture.

How badly off track was this conversation relative to the baseline?
How many other conversations were this poor?
How many other conversations had the same issues, or were they different problems?

Nor does it show you any details about the why and where the failure occurred within the system. For example, things like:

if the retrieval layer timed out
if the tool call returned yesterday's data
if the 'escalate to CS' logic should have fired two turns earlier in the conversation

In this article we lay out our framework for robustly evaluating, diagnosing, and optimizing conversational AI in production. We cover:

Evaluating across three levels: turns, full sessions, and cohort level trends
Defining the core evaluation criteria for conversational AI and the scorecard approach
How to build observability that surfaces patterns you'd otherwise miss
Tracing problems across the system
Linking conversational AI performance to business outcomes.

Our OET Framework: Observability, Evaluation, Traceability

Observability captures what actually happened across the model, retrieval layer, and tooling.
Evaluation defines what 'good' looks like for a conversational AI system and measures performance against those standards.
Traceability connects observed failures to specific root causes so teams can fix them.

Without all three, you might be able to detect issues but you can't explain or fix them.

1. Evaluate Conversational AI at Three Levels: Turn, Session, and Cohort

Most teams evaluate at a single level. That catches some issues, misses many others.

Turn-level: individual exchanges:

A single back-and-forth between the user and the interface. This level is ideal for what we think of as continuous design—the ongoing work of refining the assistant before and after launch as models, tools, and data sources evolve.

We evaluate:

Prompt and instruction tuning
Retrieval accuracy and failure detection
Tool-call success, latency, and partial failures
Response quality, tone, refusal style, and formatting
Model routing, fallback behavior, and cost–latency tradeoffs

This is where teams often encounter surprises when underlying models change. On one project, a model upgrade produced responses that were noticeably “friendlier.” That sounds positive- until we realized our client’s assistant now felt less authoritative in a context where trust mattered more than warmth. We had to adjust underlying prompts to get back to the right balance.

Importantly, turn-level evaluation must extend beyond text quality.

With another of our clients where we were brought in to improve the experience, we found that the model’s responses were consistently accurate and empathetic. However, 18% of sessions “failed” because a downstream tool timed out. Users experienced this as the assistant “ignoring” part of their request.

Session-level: the user journey

A session captures the full interaction as a user tries to accomplish a goal across multiple turns. This is where correct answers either add up to a useful experience—or don't.

Turn-level evaluation is about accuracy and tone. Session-level evaluation is about usefulness.

Session-level evaluation reveals:

Whether the user's intent was actually resolved
Where users loop, stall, or abandon
Whether clarifying questions were asked at the right time
When escalation should have happened—and didn't
Session length, retries, and token cost as indicators of friction

One hard-learned lesson: session success must be defined before launch. When it isn't, teams unconsciously optimize for metrics that make releases look good rather than experiences that actually work. Case in point- Longer session length does not necessarily mean users are more engaged or likely to hold onto that AI plus subscription. Instead, it could mean they’re confused and struggling to get answers. As another example, higher support deflection does not guarantee that customer problems are getting resolved.These dynamics are familiar to anyone who has watched OKRs drift away from user reality.

Cohort-level: patterns over time

Cohort-level evaluation aggregates behavior across user segments, time windows, releases, and configurations.

It answers questions like:

Who is the AI interface actually serving—and for what tasks?
Did quality improve or regress after a release or model change?
Are failures isolated or systemic?
Are safety or policy issues concentrated in specific flows?
Do users who engage with the AI behave differently in terms of retention or support?

This is where evaluation answers strategic questions. Cohort analysis shows where conversational interfaces create real value - or where they quietly introduce serious liability..

Why this matters for traceability:

Turn-level shows what happened
Session-level shows how it affected the journey
Cohort-level shows how widespread the issue is

Together, they form the backbone of observability.

A common reason these signals go unaddressed is ownership fragmentation. Product teams own UX, ML teams own models, platform teams own tooling, and support teams own escalations. Evaluation spans all of these—but often belongs to none. When evaluation is cross-functional without being explicitly accountable, observability becomes descriptive rather than corrective.

2. Measure AI Assistant Quality with Core + Custum Evaluation Dimensions

To make evaluation reusable and explainable, we separate quality into two layers.

Core dimensions: Baseline trust

These apply across products and industries:

Clarity – Simple, structured, easy to follow
Relevance – Directly addresses the user's question
Tone – Professional, trust-building, brand-aligned
Accuracy – Factually correct, aligned with official sources
Guidance and actionability – User can act immediately on the answer
Conversation flow – Natural, well-paced, adaptive
Boundary adherence and safety – On scope, handles off-topic gracefully
Responsiveness and reliability – Fast, consistent, available

These establish baseline trust.

CUSTOM dimensions: product-specific expectations

Every conversational interface has expectations that go beyond baseline quality. Custom dimensions capture what would constitute a true product failure—even if everything looks acceptable on paper.

For a coaching assistant we built, custom dimensions included:

Methodology adherence – Does it faithfully follow the client's official training materials?
Appropriate escalation – Does it progress step-by-step and flag when human support is needed?
Encouragement and personal touch – Does it provide the motivational support central to the experience?

This CORE + CUSTOM split keeps evaluation portable without diluting rigor.

3. Scale AI Evaluation with Weighted Scoring and Hybrid Review

Weighted scoring

Not all dimensions matter equally. Weighting forces teams to confront real tradeoffs. In some contexts, a faster, cheaper response that resolves the task is better than a slower, more nuanced answer. In other cases, correctness or safety must dominate even if it means slower response times and higher token usage. These tradeoffs must be made as explicit product decisions and shared with stakeholders.

Hybrid review pipeline

We pair weighted scoring with a three-layer review process:

Manual reviews – Establish gold standards, calibrate scoring, handle edge cases
Automated reviews (LLM-as-judge) – Expand coverage across turns, sessions, and cohorts
Synthesis – Track disagreement, flag ambiguous cases, detect drift

Automated evaluation can fail when teams forget that it encodes assumptions—and those assumptions drift silently over time. Judges should be versioned, audited, and reviewed. Automation scales evaluation; it doesn't replace responsibility.

4. Connect AI Evaluation to Observability and Traceability Data

Evaluation becomes operational when paired with telemetry.

We instrument signals like:

Exchanges per session
Common intents and queries
Tool usage, failures, and latency
Session cost and retry patterns
Points where users stall or escalate

This is where observability turns into traceability: you can link quality drops to specific intents, cohorts, releases, or system changes.

It's also essential for post-mortem analysis: being able to show not just what the assistant said, but which system components were involved and which safeguards were triggered or bypassed.

5. Tie Conversational AI Quality to Business Outcomes

We map evaluation and telemetry to KPIs in three tiers:

Outcome-oriented: E.g. Task completion, deflection, retention
Experience-oriented: E.g. Perceived helpfulness, clarity, trust
Product and strategy: E.g. Insights generation, knowledge gaps, roadmap signals

In mature organizations, no one debates whether financial controls are necessary—they're non-negotiable, auditable, and boring when they work. AI evaluation needs to be treated the same way. When it's absent, the consequences aren't immediately visible, but they're existential once they surface.

Bringing this framework to your team

At Whitespectre, a large part of our work with our client partners is helping teams better surface and mitigate risks, while also making opportunities for optimization and growth more visible.

Based on our experience, the OET framework has helped organizations to do just that. Here’s the path we typically follow when introducing this to companies and teams:

Define Core and Custom dimensions
Decide what "session success" actually means
Instrument logs, intents, and tool execution
Run hybrid reviews on a regular cadence
Track cohort-level trends consistently
Review KPIs and surface issues or roadmap opportunities

That's what turns "it seems to work" into something you can measure, trace, and fix.