20 May 2026 12:15 - 12:45
From data to deployment: Evaluating enterprise AI in production
Enterprise AI systems built on large language models are rapidly evolving beyond static question-answering into workflows involving retrieval, tool use, and multi-step agentic behavior. Reliably deploying these systems remains challenging, as performance depends heavily on data quality, context construction, and how systems behave in production.
This session presents a practical framework for building reliable enterprise AI systems, by utilizing operational data signals and structured evaluation. By decomposing quality across retrieval relevance, generation faithfulness, and agent-level behavior, rather than collapsing it into a single metric, teams can turn operational data into a foundation for continuous improvement.
The session draws on recent advances in evaluating RAG systems and agentic workflows to illustrate both the promise and limitations of automated evaluation pipelines. It also explores how production signals can be systematically fed back through targeted test sets and human calibration to keep systems reliable as models and data change over time.