Engineering Better Evals: Scalable LLM Evaluation Pipelines That Work — Dat Ngo, Aman Khan, Arize

Engineering Better Evals: Scalable LLM Evaluation Pipelines That Work — Dat Ngo, Aman Khan, Arize

446 Lượt nghe
Engineering Better Evals: Scalable LLM Evaluation Pipelines That Work — Dat Ngo, Aman Khan, Arize
As LLM-powered products become more sophisticated, the need for scalable, reliable evaluation pipelines has never been more critical. This session dives deep into advanced LLM evaluation strategies that move beyond toy benchmarks and toward real-world production impact. We’ll explore how to architect and implement evaluation pipelines that work across both online and offline environments—reducing dev complexity and accelerating iteration. The session will cover: - LLM-as-a-judge frameworks - Human-in-the-loop evaluation - How hybrid approaches unlock more robust and nuanced performance assessments We’ll break down technical architectures, share real implementation patterns, and examine trade-offs between evaluation techniques to help engineers make informed choices. Whether you’re building from scratch or refining existing workflows, this talk offers practical strategies for crafting efficient, scalable, and accurate eval pipelines tailored to custom LLM products. About Dat Ngo I'm Dat Ngo, Director of AI Solutions at Arize, where I work with the world's largest companies to build and optimize AI applications for their business. With nearly a decade of experience in the AI space, I specialize in helping organizations tackle their biggest challenges around AI evaluation, observability, and making AI systems work reliably at scale. At Arize, we partner with industry leaders including Reddit, Booking.com, Siemens, Roblox, and hundreds of other companies to solve the most complex problems in AI deployment and monitoring. This gives me unique insight into what it really takes to build production AI systems that deliver business value. My passion for AI extends beyond the office—I eat, live, and breathe AI. I'm deeply engaged with the AI community through speaking, learning, and connecting with fellow practitioners who are pushing the boundaries of what's possible with artificial intelligence. As a speaker, I bring real-world expertise from the trenches of enterprise AI deployment, sharing practical insights on evaluation frameworks, observability strategies, and the operational realities of making AI work at scale. About Aman Khan Aman is Director of Product, LLM at Arize AI. Prior to Arize, Aman was the PM on the Jukebox Feature Store in the ML Platform team at Spotify across ~50 data science teams. Aman was also PM for ML Evaluation frameworks across data science and engineering teams for self-driving cars at Cruise, which helped launch the first self-driving car service in an urban environment. Aman studied Mechanical Engineering at UC Berkeley and lived in the SF Bay Area for 9 years before moving to NYC. Recorded at the AI Engineer World's Fair in San Francisco. Stay up to date on our upcoming events and content by joining our newsletter here: https://www.ai.engineer/newsletter