Evals, Security & Trust
You can now ship AI features that pass a security review - and run a prompt injection on your own bot to find the holes.
3-hour live session · 90 min theory · 90 min hands-on · plus one exercise
The 3-hour live session
Theory grounds you. Hands-on lands it. Both happen in the same Saturday session.
Theory
The session that separates 'I built a demo' from 'I shipped to production'. How to know your AI is still working three weeks after launch, when usage has dropped and you can't tell why. How to talk to security teams about prompt injection. How to measure what an AI feature actually does - because eval scores aren't the same as production metrics. And what the Air Canada chatbot case teaches every PM shipping AI in 2026.
Hands-on
Wire observability onto what you built. Run a prompt injection on your own bot - and feel the chill when it works. Build the guardrail. By the end you'll know how to ship AI you'd actually let your CEO demo without sweating.
Your exercise for the week
Add observability to your Project 2 voice agent. Run 10 calls. Find one case where the output went wrong. Bring the trace.
The moment that lands
A real production chatbot architecture from a Cohort 1 participant's company: every time the PM wants to update the prompt, they go to Langfuse and update it. The engineering team's code fetches the latest version. The PM owns the prompt. The engineer owns the infrastructure. That is the correct architecture.
What I'll keep saying
- “You can't blame the model. You're expected to ship a good output. The eval is yours.”
- “In traditional software you can trace errors. You can't see the model's thoughts. That's why Langfuse exists.”
- “100% accuracy you do not expect. 90-95-98% is the realistic ceiling. Build the fallback.”
Want this week in your real life?
DM me on WhatsApp or book a 30-minute call. The cohort is 12 to 18 PMs and I personally vet every applicant.