Can LLMs Be Trusted for Data Analysis? New Research Says Not Yet
Key Takeaways
- ▸GPT-5.4 with extra-high reasoning is the most reliable LLM tested for EDA tasks, but even it achieves only moderate Business utility (0.6952)
- ▸Most LLM configurations are not trustworthy for autonomous exploratory data analysis despite acceptable average performance scores
- ▸Consistency and repeatability are co-equal dimensions of operational trustworthiness alongside raw accuracy
Summary
A new academic paper evaluates whether large language models can reliably serve as exploratory data analysis (EDA) agents in business workflows. Researchers benchmarked 15 model variants from eight families using a supply chain simulation to identify quality and sales issues, finding that most configurations lack the consistency and repeatability needed for autonomous use—even when their average performance appears acceptable.
GPT-5.4 with extra-high reasoning emerged as the top performer with a Business utility score of 0.6952, substantially outperforming other configurations. The researchers introduced a new evaluation framework called "Business utility" that combines mean performance, coefficient of variation (consistency), and cross-condition sensitivity to measure operational trustworthiness. The key finding challenges conventional wisdom: variability matters as much as accuracy. A model with high average scores but inconsistent outputs poses significant operational risk in production environments where reliability is essential.
- A new 'Business utility' metric reveals that output variability directly discounts operational value and raises deployment risk
Editorial Opinion
This research challenges a widespread assumption in enterprise AI adoption: that LLMs with strong average performance are ready for autonomous deployment. By treating consistency as a hard requirement rather than a nice-to-have, the paper establishes a more realistic standard for production data analytics. While GPT-5.4's emergence as the clear leader is encouraging for OpenAI, the findings suggest most existing LLM alternatives remain too unpredictable for high-stakes analytical work without substantial human oversight.



