Benchmarking LLMs Metrics, Challenges, and Best Practices for Evaluation - DevConf.IN 2025
Speaker(s): Ravindra Patil
---
LLMs have been very useful and we have high potential for LLMs in Enterprises. However, evaluating these models remains a complex challenge and one of the reasons for LLMs not being adopted directly.
The responsible and ethical AI is going to be the key for Enterprises to adopt the LLMs for their business needs.
Traditional metrics like perplexity or BLEU score often fail to capture the nuanced capabilities of LLMs in real-world applications.
This talk is about current best practices in benchmarking LLMs, limitations of existing approaches and emerging evaluation techniques.
We’ll explore a range of qualitative and quantitative metrics,
from task-specific benchmarks (e.g., code generation, summarization)
to user-centric evaluations (e.g., coherence, creativity, bias detection).
importance of specialized benchmarks that test LLMs on ethical and explainability grounds
Outcome : The audience will be able to understand how to choose LLMs for the right balance of accuracy, efficiency, and fairness. Additionally understand what has improved in granite 3.0 which makes it better LLM.
---
Slides and other resources:
https://pretalx.devconf.info/devconf-in-2025/talk/9EP8YM/