Today, we're joined by Shreya Shankar, a PhD student at UC Berkeley to discuss DocETL - https://www.docetl.com/, a declarative system for building and optimizing LLM-powered data processing pipelines for large-scale and complex document analysis tasks. We explore how DocETL's optimizer architecture works, the intricacies of building agentic systems for data processing, the current landscape of benchmarks for data processing tasks, how these differ from reasoning-based benchmarks, and the need for robust evaluation methods for human-in-the-loop LLM workflows. Additionally, Shreya shares real-world applications of DocETL, the importance of effective validation prompts, and building robust and fault-tolerant agentic systems. Lastly, we cover the need for benchmarks tailored to LLM-powered data processing tasks and the future directions for DocETL.
🎧 / 🎥 Listen or watch the full episode on our page: https://twimlai.com/go/703.
🔔 Subscribe to our channel for more great content just like this: https://youtube.com/twimlai?sub_confirmation=1
🗣️ CONNECT WITH US!
===============================
Subscribe to the TWIML AI Podcast: https://twimlai.com/podcast/twimlai/
Follow us on Twitter: https://twitter.com/twimlai
Follow us on LinkedIn: https://www.linkedin.com/company/twimlai/
Join our Slack Community: https://twimlai.com/community/
Subscribe to our newsletter: https://twimlai.com/newsletter/
Want to get in touch? Send us a message: https://twimlai.com/contact/
📖 CHAPTERS
===============================
00:00 - Introduction
4:57 - Challenges in AI interface design
9:02 - DocETL
14:13 - Data connector challenges in ETL systems
15:04 - UI for document processing
17:49 - Model support
18:54 - Data extraction tasks
21:08 - Prompts and HITL
25:32 - Evaluation in data processing
31:31 - Agents and agentic systems
38:46 - Benchmarks for data processing
43:44 - States-based models or long-context LLMs
44:02 - Future directions
🔗 LINKS & RESOURCES
===============================
DocETL - https://www.docetl.com/
Reimagining LLM-Powered Unstructured Data Analysis with DocETL - https://data-people-group.github.io/blogs/2024/09/24/docetl/
EPIC Data Lab - https://epic.berkeley.edu/
📸 Camera: https://amzn.to/3TQ3zsg
🎙️Microphone: https://amzn.to/3t5zXeV
🚦Lights: https://amzn.to/3TQlX49
🎛️ Audio Interface: https://amzn.to/3TVFAIq
🎚️ Stream Deck: https://amzn.to/3zzm7F5