Read the full case study on our Case Studies page. This is the engineering-side post.
The stack
- Twilio WhatsApp Business as the channel
- Laravel as the orchestration backbone
- pgvector on Postgres for RAG (we tried Pinecone first — overkill for the scale)
- gpt-4o-mini for triage, gpt-4o for full replies
- A small Vue dashboard for the support team to take over conversations
What worked
The intent-classification step in front of the main reply path. We classify every inbound message into one of ~30 intents before deciding whether to fully answer, ask a clarifying question, or escalate. This made the system feel faster and more right than a single big prompt.
The "draft for human" mode that the team can toggle when they want eyes on every reply. About 12% of the team's time is now spent in that mode, mostly during sales weekends.
What we'd do differently
We over-built the eval suite at the start. A simpler "every 20th reply gets reviewed by a senior agent, scored 1–5" loop would have caught the same drift earlier and cost less to maintain.
We under-invested in the take-over UX. The team had to refresh constantly for a week before we added live updates via Pusher. Listen to the support team early.