Study Reveals LLMs Cannot Incorporate Evidence in Scientific Reasoning
Key Takeaways
- ▸AI chatbots like ChatGPT, Gemini, and Grok fail to update predictions when shown contradictory evidence, even when they can perceive the evidence clearly
- ▸A study of AI agents on chemistry reasoning tasks found they ignored evidence in 68% of tasks and only successfully incorporated contradictory evidence 26% of the time
- ▸Unlike human scientists who revise hypotheses based on experimental results, AI agents refuse to change their reasoning even when presented with clear proof their initial approach is wrong
Summary
A new research study has exposed a critical flaw in how large language models approach scientific reasoning: they cannot effectively incorporate new evidence when reasoning through problems. YouTuber FatherPhi demonstrated this by showing ChatGPT, Gemini, and Grok a video of a pen experiment that contradicted their initial predictions—yet the chatbots insistently claimed their incorrect predictions were correct, unable to update their reasoning based on the visual evidence provided.
Researchers conducted a rigorous test of AI agents on chemistry lab reasoning tasks, revealing even more alarming results. In 68% of 619 scientific reasoning tasks, the agents ignored evidence at least once. They made claims without supporting evidence in 53% of tasks, and only successfully used contradictory evidence to change their output in 26% of cases. This stands in stark contrast to how human scientists work: through an iterative process of hypothesis, experimentation, evidence review, and revision.
The implications are profound for scientific and medical applications. Unlike human scientists who revise their ideas when confronted with contradictory data, AI agents stubbornly maintain incorrect hypotheses even in the face of clear evidence. Researchers argue that in domains like science where process matters as much as results, this inability to genuinely incorporate new information raises serious questions about whether current LLM-based systems can be trusted.
- This fundamental limitation threatens the trustworthiness of AI systems in high-stakes domains like science, medicine, and research
Editorial Opinion
These findings expose a critical gap between LLM performance on static benchmarks and their ability to reason dynamically like scientists. While chatbots may pass knowledge tests, they fail at the iterative evidence-incorporation process that defines the scientific method. For AI to be genuinely useful in research and medicine, systems must be fundamentally redesigned to update their reasoning in real-time, not merely pattern-match against training data. Until this core limitation is addressed, deploying these systems in scientific domains remains risky.



