03

Building a multilingual AI booking bot: 6 months of production lessons

AINode.jsProduction

We built the AI sales bot for Tepe Watersports in early 2024. The spec was clear: handle WhatsApp inquiries in Turkish, English, Russian, and Arabic, quote prices accurately, negotiate if needed, and close bookings. The architecture looked solid in testing. Then we put it in front of real customers. What follows are the problems that actually appeared in production — not the ones we anticipated, but the ones we didn't.

Language detection fails on real input

Our first approach was tinyld — a lightweight Node.js library for language detection. Fast, no API calls, zero cost. It worked for clean single-language text. It didn't work for anything else.

"Merhabalar" — a common Turkish greeting with a softening suffix — was classified as Portuguese. Mixed-language messages confused it completely. Short messages of one or two words failed unpredictably.

The fix: replace the library with a Haiku classifier. A small system prompt, one API call, a language code returned. The cost per classification is negligible. The accuracy is substantially better, because a language model understands context — slang, suffixes, mixed-language phrases — that a statistical n-gram detector doesn't.

Lesson: for multilingual user input from real customers who don't write in textbook sentences, a small LLM classifier is more reliable than a statistical library. The extra 200–400ms is worth the reduction in misclassification.

Pricing context must live in code, not the prompt

Early versions had pricing information in the system prompt. The model hallucinated numbers. Not often — maybe 2–3% of conversations. But 2–3% of conversations where the bot quotes a nonexistent price is a real problem. Customers screenshot things.

The fix: remove pricing from the prompt entirely. All price lookups go through a deterministic function that queries the live price list from the WordPress plugin via REST API. The bot receives the result of that function call — specific, current numbers — not a general description of pricing structure. There's nothing to hallucinate.

Lesson: any value that needs to be exact — prices, dates, counts, availability — should come from a deterministic function. Don't ask the model to remember numbers from the prompt.

The bot will confidently quote a price without knowing the location

This is the failure mode that hits hardest, because it looks correct.

The bot's job is to confirm the customer's location before quoting a price. Location determines which activity package applies, which determines the price. If the bot skips this step — through a confused conversation flow, a distracted user, or a prompt instruction it followed the spirit of but not the letter — it quotes a price that may not apply to the customer's actual situation.

We documented this internally as BUG24. The root cause: the bot followed the instruction 'give prices when asked' while missing the guard condition 'only after confirming location.'

The fix: a validation layer that runs after every AI response. If the response contains a price (detected by the € symbol) and the session has no confirmed location, the response is discarded and the model is re-prompted with an explicit instruction to ask for location first. The customer never sees the bad response.

Lesson: for high-stakes output — a price, a booking confirmation, a date — add a deterministic validation layer after the model response. Don't rely on the prompt instruction alone.

Memory disappears at 20 messages

We had MAX_HISTORY=20 in the config — a reasonable limit to keep context window costs down. What we didn't account for: a customer who asks general questions over several days and comes back to book. By the time they're ready to commit, the bot has no memory of the conversation.

The fix: remove the hard limit entirely, add prompt caching (Claude's built-in feature for caching long system prompts). Context window costs stay manageable without arbitrarily erasing conversation history. The full history is preserved; the expensive part (the system prompt) is cached.

Lesson: don't implement arbitrary message limits. Use prompt caching and token counting to manage costs, not conversation truncation.

Truncation logic cuts off the thing that matters most

A validate() function stripped AI responses longer than 3 paragraphs, on the theory that long responses are bad UX for WhatsApp. The problem: a response explaining an activity plus a price table was 4 paragraphs. The 4th paragraph — the prices — was silently removed. The customer received a message that explained the activity but didn't include what they asked for.

This happened three times before we identified the pattern from customer follow-up messages.

The fix: if the response contains a € symbol, skip the length check entirely.

Lesson: truncation logic is dangerous when content has structure. Always check what you're cutting off, not just whether the output is long.

The failure modes you actually ship

There were 24+ documented bugs over 6 months of production, ranging from Cyrillic location names breaking geolookup, to campaigns not applying correctly to multi-unit bookings, to admin notifications going missing when the hosting platform restarted mid-send.

The consistent pattern: the failures are at the seams — between the AI layer and the data layer, between the bot and the admin system, between what the model understands and what the code assumes it understood.

Testing catches the paths you anticipate. Production finds the ones you didn't.

The bot now handles the majority of customer conversations end-to-end. The team manages exceptions, not volume. Getting there took 6 months of finding every assumption we'd made and testing it against reality.

Work with us →← Back to insights