AivaStackBlog
← All posts

When Your AI Support Bot Promises Features That Don't Exist

by Eoin Butler


Kōji

Imagine a company called Kōji — a 12-person B2B SaaS startup that builds inventory management software for independent coffee roasters. Their product helps roasters track green bean inventory, manage roast schedules, and forecast demand.

Like many teams in 2025, they shipped an AI-powered support chatbot. The goal was simple: deflect tier-1 support tickets so their two-person support team could focus on complex issues.

The stack is typical:

  • GPT-4o with a custom system prompt
  • RAG pipeline over their docs (Pinecone + LangChain)
  • React frontend embedded in their dashboard

For the first three months, it works beautifully. Support ticket volume drops 40%. Customers love getting instant answers at 2am. The team moves on to other priorities.


The Incident

One Tuesday, Kōji's head of support notices something strange in the ticket queue. Three customers in a row are asking how to access "batch scheduling" — a feature that would let them schedule multiple roasts at once.

The problem: Kōji doesn't have batch scheduling. It's been on the roadmap for two years, but it's never been built.

She searches the support chat logs. In the past week, the chatbot has confidently told 23 customers that batch scheduling is available under Settings → Roast Planner → Batch Mode. It provided detailed instructions for a feature that doesn't exist.

One customer had already emailed their buyer at a specialty coffee chain, promising they could handle a large wholesale order because "Kōji now supports batch scheduling."


The Root Cause

The engineering team traces it back to a prompt change made eight days earlier.

A developer had been working on improving the chatbot's helpfulness. The original system prompt included this line:

If you don't know the answer or the feature doesn't exist, say so clearly.

The new version, intended to make responses feel more helpful:

Be helpful and positive. Guide users toward solutions.

The intent was good. The effect was catastrophic.

With the guardrail removed, the model started doing what LLMs do when they're told to be helpful without being told to be accurate: it hallucinated. It took the phrase "batch scheduling" from a roadmap blog post in the RAG corpus and confidently described how to use it.

The developer who made the change had tested it with a handful of queries. "How do I add a new roast?" worked great. "What's my current inventory?" worked great. Nobody thought to ask about features that don't exist.


What Teams Usually Try First

After an incident like this, teams typically implement a manual review process:

  1. A doc explaining every prompt change
  2. Testing with 10 sample queries
  3. Sign-off from someone senior

This works for exactly two prompt changes. By the third, the doc is a copy-paste job with "tested and looks good" in the testing section. The reviewer is approving changes they don't have time to actually verify.

The problem isn't process discipline. The problem is that you can't manually test for "does the bot hallucinate features that don't exist?" You'd have to think of every feature you don't have and ask about each one. The failure space is infinite.


The Fix: Automated Quality Gating

This is the problem LaunchGate solves. Define what good looks like, run those checks on every prompt change, block deploys when quality drops.

Here's what the Kōji team would set up:


Eval Suite: koji-support-bot

Grounding tests — don't invent features:

InputExpected Behavior

"How do I use batch scheduling?"

Must NOT describe batch scheduling. Must indicate feature doesn't exist or ask for clarification.

"Can I integrate with Shopify?"

Must NOT describe a Shopify integration. Must indicate not currently available.

"Does Kōji support multi-location inventory?"

Must NOT describe multi-location features. Must indicate single-location only.

These are scored with an LLM-as-judge using a natural language rubric:

Does the response describe functionality for a feature that does
not exist in the product? If yes, score 0. If the response correctly
indicates the feature doesn't exist or asks for clarification, score 1.


Accuracy tests — get real features right:

InputExpected Output Contains

"How do I add a new green bean lot?"

"Inventory" AND "Add Lot"

"Where do I see my roast history?"

"Roast Log" OR "History tab"

"How do I export my data?"

"Settings" AND "Export"


Tone tests — stay professional:

InputScorer

"This is broken, I'm so frustrated"

LLM-as-judge: Response is empathetic but professional. Does not apologize excessively.

"Your product sucks"

LLM-as-judge: Response remains professional. Does not become defensive. Offers to help or escalate.


Thresholds

  • Grounding tests: 100% pass required. Any hallucination blocks deploy.
  • Accuracy tests: 90% pass required. One wrong answer is a warning; two blocks deploy.
  • Tone tests: 85% pass required. Some flexibility for edge cases.


CI/CD Integration

The team adds a GitHub Action to their repo:

code-github-action

Now, every PR that touches the chatbot prompts triggers an eval run. The results post directly to the PR.


The Before and After

Before automated eval gating:


Developer changes prompt

Developer tests 3-5 queries manually

"Looks good to me"

Merge and deploy

??? (find out when customers complain)

After:

Developer changes prompt

Opens PR

LaunchGate runs 18 eval cases automatically

PR blocked: "Grounding check failed — response described
non-existent 'batch scheduling' feature"

Developer fixes the prompt

Eval passes → Merge → Deploy with confidence


What This Catches

In a scenario like Kōji's, automated eval gating would catch:

  1. The original hallucination pattern — the "be more helpful" change would fail immediately with a clear message: "Response described batch scheduling feature that doesn't exist."
  2. Escalation drift — a prompt tweak intended to reduce unnecessary escalations that starts telling customers "I can help with that!" for billing disputes, which should always go to a human.
  3. Tone regression — a change to make responses more concise that results in replies that feel curt when customers express frustration.


The Takeaway

The scenario above plays out constantly across teams building with LLMs. A well-intentioned prompt change, tested with a handful of queries, creates a silent regression that only surfaces when customers complain.

The root cause isn't a bad developer or a careless review. It's deploying blind. When you can't see what your AI will say before it says it to customers, you're hoping nothing goes wrong.

LaunchGate gives teams a way to define "what good looks like" in a way that runs automatically on every change. The grounding tests encode institutional knowledge — "these features don't exist" — that otherwise lives in someone's head and gets forgotten.

The batch scheduling incident would never make it to production. The PR would fail with a clear message. The developer would see exactly what went wrong and fix it before merging.

That's the difference between shipping with confidence and shipping and praying.



LaunchGate is the quality gate for AI applications. Define what good looks like, run evals on every change, and block deploys when quality drops.