BotBeat
...
← Back

> ▌

Independent ResearchIndependent Research
RESEARCHIndependent Research2026-02-28

Researcher Discovers 'Few-Shot Collapse': Adding Examples Can Degrade LLM Performance by 50%

Key Takeaways

  • ▸Testing revealed three failure patterns: peak regression (64% → 33% accuracy), ranking reversal (zero-shot leaders falling behind), and selection method sensitivity (50%+ → 35% collapse)
  • ▸The phenomenon affects multiple LLMs systematically, not just edge cases, and aligns with recent academic research on 'over-prompting' and 'context rot'
  • ▸AdaptGauge provides automatic detection and classification of few-shot collapse patterns across different shot counts (0, 1, 2, 4, 8 examples)
Source:
Hacker Newshttps://github.com/ShuntaroOkuma/adapt-gauge-core↗

Summary

Independent researcher Shuntaro Okuma has released AdaptGauge, an open-source tool that reveals a counterintuitive phenomenon in large language models: adding more training examples can actively harm performance. Testing 8 LLMs across 4 tasks with varying numbers of few-shot examples (0, 1, 2, 4, and 8 shots), Okuma documented three distinct failure patterns. The most dramatic was 'peak regression,' where Google's Gemini Flash achieved 64% accuracy with 4 examples, then plummeted to 33% when given 8 examples—a near 50% performance collapse.

The research identified two additional concerning patterns: 'ranking reversal,' where models that performed best with zero examples fell to third place once examples were added, and severe sensitivity to example selection methods. When switching from hand-picked examples to TF-IDF-based selection, one model's performance crashed from over 50% to 35%. These findings align with emerging research on 'over-prompting' (Tang et al. 2025) and 'context rot' documented by Chroma Research, suggesting this is a systemic issue rather than isolated edge cases.

AdaptGauge addresses a critical blind spot in LLM evaluation: standard benchmarks measure accuracy at a single point, missing how models behave as more context is added. The tool automatically tracks learning curves across different shot counts and flags collapse patterns, classifying them as immediate, gradual, or peak regression. Released under MIT license with pre-computed demo results, AdaptGauge enables developers to test models without API keys. The research challenges the common assumption that more examples always improve performance and raises questions about production deployment strategies that rely heavily on few-shot prompting.

  • Standard LLM benchmarks miss this issue by testing only at single points, while production systems often rely on few-shot prompting that may unknowingly degrade performance

Editorial Opinion

This research exposes a troubling gap between how we evaluate LLMs and how we use them in production. The industry has largely assumed that providing more examples improves performance—a reasonable assumption that turns out to be dangerously wrong for certain model-task combinations. The 50% performance drops documented here aren't minor degradations; they could mean the difference between a functional production system and a broken one. Perhaps most concerning is that leaderboard rankings reverse based on shot count, meaning teams may be selecting models based on benchmarks that don't reflect their actual use case.

Large Language Models (LLMs)Machine LearningMLOps & InfrastructureAI Safety & AlignmentOpen Source

More from Independent Research

Independent ResearchIndependent Research
RESEARCH

New Research Proposes Infrastructure-Level Safety Framework for Advanced AI Systems

2026-04-05
Independent ResearchIndependent Research
RESEARCH

DeepFocus-BP: Novel Adaptive Backpropagation Algorithm Achieves 66% FLOP Reduction with Improved NLP Accuracy

2026-04-04
Independent ResearchIndependent Research
RESEARCH

Research Reveals How Large Language Models Process and Represent Emotions

2026-04-03

Comments

Suggested

OracleOracle
POLICY & REGULATION

AI Agents Promise to 'Run the Business'—But Who's Liable When Things Go Wrong?

2026-04-05
Google / AlphabetGoogle / Alphabet
RESEARCH

Deep Dive: Optimizing Sharded Matrix Multiplication on TPU with Pallas

2026-04-05
AnthropicAnthropic
POLICY & REGULATION

Anthropic Explores AI's Role in Autonomous Weapons Policy with Pentagon Discussion

2026-04-05
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us