BotBeat
...
← Back

> ▌

Google / AlphabetGoogle / Alphabet
RESEARCHGoogle / Alphabet2026-04-15

Data Pruning Strategy Enables LLMs to Memorize 1.3X More Facts With Smaller Models

Key Takeaways

  • ▸Data pruning and frequency flattening can improve LLM fact memorization by reducing information overload within model capacity constraints
  • ▸A 110M-parameter model trained with the proposed data selection method matched the factual accuracy of a 1.3B-parameter model on standard training
  • ▸The research provides an information-theoretic framework for understanding fact memorization and its relationship to model capacity and training data distribution
Source:
Hacker Newshttps://machinelearning.apple.com/research/cram-less↗

Summary

Researchers at Google have published a paper demonstrating that strategic training data pruning can significantly improve how well large language models memorize factual knowledge. The study, accepted at ICLR 2026's Workshop on Navigating and Addressing Data Problems for Foundation Models, formalizes fact memorization from an information-theoretic perspective and identifies how training data distributions affect factual accuracy. The researchers found that fact accuracy becomes suboptimal when the amount of information in training data exceeds model capacity, particularly when facts follow a skewed frequency distribution like a power law.

The team proposes data selection schemes based on training loss that reduce the number of facts in training data and flatten their frequency distribution. In experiments on semi-synthetic datasets, their method boosted fact accuracy to the model's theoretical capacity limit. Most notably, when pretraining a GPT2-Small model (110M parameters) on an annotated Wikipedia corpus using their selection method, the smaller model memorized 1.3X more entity facts than standard training—matching the performance of a 10X larger model (1.3B parameters) trained on the full dataset. This approach addresses a critical challenge in LLMs: hallucinations and poor performance on knowledge-intensive tasks caused by insufficient fact memorization.

  • Strategic training data curation could enable smaller, more efficient models to achieve knowledge-intensive performance comparable to much larger models

Editorial Opinion

This research addresses a fundamental tension in modern LLMs: the gap between model capacity and the volume of factual information in training data. By proving that smarter data curation can achieve the same performance with significantly smaller models, the work has important implications for efficiency and sustainability in AI development. If these findings generalize beyond the tested scenarios, data pruning strategies could become a standard optimization technique, reducing computational requirements while improving factual accuracy—a win-win for both performance and environmental impact.

Large Language Models (LLMs)Natural Language Processing (NLP)Generative AIMachine LearningData Science & Analytics

More from Google / Alphabet

Google / AlphabetGoogle / Alphabet
UPDATE

Google Prepares Rollout of Skills Feature Across Gemini and AI Studio

2026-04-16
Google / AlphabetGoogle / Alphabet
PARTNERSHIP

Google and Pentagon in Advanced Discussions Over Classified AI Deal

2026-04-16
Google / AlphabetGoogle / Alphabet
UPDATE

Google Gemini Now Generates Personalized AI Images Using Your Google Photos Library

2026-04-16

Comments

Suggested

OpenAIOpenAI
RESEARCH

OpenAI's GPT-5.4 Pro Solves Longstanding Erdős Math Problem, Reveals Novel Mathematical Connections

2026-04-17
AnthropicAnthropic
PARTNERSHIP

White House Pushes US Agencies to Adopt Anthropic's AI Technology

2026-04-17
AnthropicAnthropic
PRODUCT LAUNCH

Finance Leaders Sound Alarm as Anthropic's Claude Mythos Expands to UK Banks

2026-04-17
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us