Moody's Develops LLM-Based Judge for Automating Search Relevance Evaluation in Financial Research

Key Takeaways

▸Moody's automated search relevance evaluation system using LLM judges achieves 80%+ agreement with human domain experts while reducing evaluation time from days to minutes
▸The framework enables scalable, cost-effective quality assurance for RAG systems in financial research by replacing expensive manual expert evaluation with automated LLM-based assessment
▸The solution uses iterative prompt engineering, few-shot learning, and standard IR metrics (precision, recall, nDCG) to maintain high correlation with expert judgments while accelerating product iteration cycles

Source:

Hacker Newshttps://haystackconf.com/us2025/talk-9/↗

Summary

Moody's has developed an automated relevance evaluation framework that uses large language models as judges to assess the quality of semantic search results across millions of financial research documents. The system, designed to evaluate context retrieved for Moody's Research Assistant—a retrieval-augmented generation (RAG) application—achieves over 80% agreement with domain expert evaluators while dramatically reducing evaluation time and costs.

The framework employs iterative prompt tuning, few-shot learning, and explicit evaluation criteria to automatically assess search relevance using standard information retrieval metrics including precision, recall, and normalized discounted cumulative gain (nDCG). By automating what was previously a manual, expert-driven process, the system cuts experiment iteration time from days to minutes, enabling rapid testing and refinement of search algorithms.

While the LLM-based evaluation system demonstrates strong performance on general financial content, Moody's acknowledges current limitations with highly technical financial concepts and is pursuing further improvements through enhanced prompt engineering and integration of expert feedback. The approach represents a scalable, cost-effective alternative to traditional evaluation methods critical for maintaining accuracy in financial research applications where data misinterpretation carries significant consequences.

Current limitations with complex financial terminology highlight the ongoing challenge of applying general LLMs to specialized domains, requiring continued refinement and expert feedback integration

Editorial Opinion

This development showcases a pragmatic approach to a real industry challenge: the cost and time constraints of maintaining quality in retrieval systems at scale. By achieving 80% agreement with experts, Moody's has found a practical sweet spot where automation delivers substantial efficiency gains while acknowledging its limitations. However, the remaining 20% gap—particularly with specialized financial concepts—underscores why domain-specific fine-tuning and human-in-the-loop validation remain essential in high-stakes applications where accuracy directly impacts financial decisions.

Moody's Develops LLM-Based Judge for Automating Search Relevance Evaluation in Financial Research

Key Takeaways

▸Moody's automated search relevance evaluation system using LLM judges achieves 80%+ agreement with human domain experts while reducing evaluation time from days to minutes
▸The framework enables scalable, cost-effective quality assurance for RAG systems in financial research by replacing expensive manual expert evaluation with automated LLM-based assessment
▸The solution uses iterative prompt engineering, few-shot learning, and standard IR metrics (precision, recall, nDCG) to maintain high correlation with expert judgments while accelerating product iteration cycles

Summary

Current limitations with complex financial terminology highlight the ongoing challenge of applying general LLMs to specialized domains, requiring continued refinement and expert feedback integration

Editorial Opinion

This development showcases a pragmatic approach to a real industry challenge: the cost and time constraints of maintaining quality in retrieval systems at scale. By achieving 80% agreement with experts, Moody's has found a practical sweet spot where automation delivers substantial efficiency gains while acknowledging its limitations. However, the remaining 20% gap—particularly with specialized financial concepts—underscores why domain-specific fine-tuning and human-in-the-loop validation remain essential in high-stakes applications where accuracy directly impacts financial decisions.

Moody's Develops LLM-Based Judge for Automating Search Relevance Evaluation in Financial Research

Key Takeaways

Summary

Editorial Opinion

More from Moody's

Moody's Assigns 28% Haircut to Bitcoin in $100M Bond Deal, Signaling Tighter Wall Street Valuations

Comments

Suggested

Stanford Researchers Advance HIP Kernel Generation Using Multi-Agent AI and Reinforcement Learning

Midjourney and Other AI Image Generators Perpetuate Global Stereotypes, Analysis Reveals

Harvey AI Reaches $11 Billion Valuation After Rising from Reddit Origins

Moody's Develops LLM-Based Judge for Automating Search Relevance Evaluation in Financial Research

Key Takeaways

Summary

Editorial Opinion

More from Moody's

Moody's Assigns 28% Haircut to Bitcoin in $100M Bond Deal, Signaling Tighter Wall Street Valuations

Comments

Suggested

Stanford Researchers Advance HIP Kernel Generation Using Multi-Agent AI and Reinforcement Learning

Midjourney and Other AI Image Generators Perpetuate Global Stereotypes, Analysis Reveals

Harvey AI Reaches $11 Billion Valuation After Rising from Reddit Origins