Generative Benchmarking: Measuring AI Models Beyond Accuracy [Kelly Hong] - 728

Generative Benchmarking: Measuring AI Models Beyond Accuracy [Kelly Hong] - 728

455 Lượt nghe
Generative Benchmarking: Measuring AI Models Beyond Accuracy [Kelly Hong] - 728
In this episode, Kelly Hong, a researcher at Chroma, joins us to discuss "Generative Benchmarking," a novel approach to evaluating retrieval systems, like RAG applications, using synthetic data. Kelly explains how traditional benchmarks like MTEB fail to represent real-world query patterns and how embedding models that perform well on public benchmarks often underperform in production. The conversation explores the two-step process of Generative Benchmarking: filtering documents to focus on relevant content and generating queries that mimic actual user behavior. Kelly shares insights from applying this approach to Weights & Biases' technical support bot, revealing how domain-specific evaluation provides more accurate assessments of embedding model performance. We also discuss the importance of aligning LLM judges with human preferences, the impact of chunking strategies on retrieval effectiveness, and how production queries differ from benchmark queries in ambiguity and style. Throughout the episode, Kelly emphasizes the need for systematic evaluation approaches that go beyond "vibe checks" to help developers build more effective RAG applications. 🎧 / 🎥 Listen or watch the full episode on our page: https://twimlai.com/go/728. 🔔 Subscribe to our channel for more great content just like this: https://youtube.com/twimlai?sub_confirmation=1 🗣️ CONNECT WITH US! =============================== Subscribe to the TWIML AI Podcast: https://twimlai.com/podcast/twimlai/ Follow us on Twitter: https://twitter.com/twimlai Follow us on LinkedIn: https://www.linkedin.com/company/twimlai/ Join our Slack Community: https://twimlai.com/community/ Subscribe to our newsletter: https://twimlai.com/newsletter/ Want to get in touch? Send us a message: https://twimlai.com/contact/ 📖 CHAPTERS =============================== 00:00 - Introduction 2:32 - Origin of the project 5:44 - Evaluation process 8:32 - Generative benchmarking process 15:32 - Distinction in user queries with public benchmarks 19:02 - Evaluating user queries 20:24 - Impact of embedding models and chunking strategies on retrieval performance 22:41 - Metrics 24:01 - Alignment and LLM as judge 26:15 - Evalgen 28:36 - Data labeling process 34:44 - Future directions 37:19 - Considerations of information retrieval 39:21 - Distractors 43:29 - Naive query generation 46:16 - Representativeness 47:56 - Misconception about generative benchmarking 🔗 LINKS & RESOURCES =============================== Generative Benchmarking - https://research.trychroma.com/generative-benchmarking 📸 Camera: https://amzn.to/3TQ3zsg 🎙️Microphone: https://amzn.to/3t5zXeV 🚦Lights: https://amzn.to/3TQlX49 🎛️ Audio Interface: https://amzn.to/3TVFAIq 🎚️ Stream Deck: https://amzn.to/3zzm7F5