BotBeat
...
← Back

> ▌

Google / AlphabetGoogle / Alphabet
RESEARCHGoogle / Alphabet2026-04-15

Data Pruning Strategy Enables LLMs to Memorize 1.3X More Facts With Smaller Models

Key Takeaways

  • ▸Data pruning and frequency flattening can improve LLM fact memorization by reducing information overload within model capacity constraints
  • ▸A 110M-parameter model trained with the proposed data selection method matched the factual accuracy of a 1.3B-parameter model on standard training
  • ▸The research provides an information-theoretic framework for understanding fact memorization and its relationship to model capacity and training data distribution
Source:
Hacker Newshttps://machinelearning.apple.com/research/cram-less↗

Summary

Researchers at Google have published a paper demonstrating that strategic training data pruning can significantly improve how well large language models memorize factual knowledge. The study, accepted at ICLR 2026's Workshop on Navigating and Addressing Data Problems for Foundation Models, formalizes fact memorization from an information-theoretic perspective and identifies how training data distributions affect factual accuracy. The researchers found that fact accuracy becomes suboptimal when the amount of information in training data exceeds model capacity, particularly when facts follow a skewed frequency distribution like a power law.

The team proposes data selection schemes based on training loss that reduce the number of facts in training data and flatten their frequency distribution. In experiments on semi-synthetic datasets, their method boosted fact accuracy to the model's theoretical capacity limit. Most notably, when pretraining a GPT2-Small model (110M parameters) on an annotated Wikipedia corpus using their selection method, the smaller model memorized 1.3X more entity facts than standard training—matching the performance of a 10X larger model (1.3B parameters) trained on the full dataset. This approach addresses a critical challenge in LLMs: hallucinations and poor performance on knowledge-intensive tasks caused by insufficient fact memorization.

  • Strategic training data curation could enable smaller, more efficient models to achieve knowledge-intensive performance comparable to much larger models

Editorial Opinion

This research addresses a fundamental tension in modern LLMs: the gap between model capacity and the volume of factual information in training data. By proving that smarter data curation can achieve the same performance with significantly smaller models, the work has important implications for efficiency and sustainability in AI development. If these findings generalize beyond the tested scenarios, data pruning strategies could become a standard optimization technique, reducing computational requirements while improving factual accuracy—a win-win for both performance and environmental impact.

Large Language Models (LLMs)Natural Language Processing (NLP)Generative AIMachine LearningData Science & Analytics

More from Google / Alphabet

Google / AlphabetGoogle / Alphabet
FUNDING & BUSINESS

Google Seeks to Raise $80B for AI Infrastructure Investment

2026-06-01
Google / AlphabetGoogle / Alphabet
FUNDING & BUSINESS

Alphabet to Raise $80B in Equity Capital for AI Spending

2026-06-01
Google / AlphabetGoogle / Alphabet
RESEARCH

Google Deploying Agentic AI Across Site Reliability Engineering Operations

2026-06-01

Comments

Suggested

OpenAIOpenAI
PARTNERSHIP

GPT-5.5 and Codex Now Generally Available on Amazon Bedrock

2026-06-01
Linux Foundation / Zephyr ProjectLinux Foundation / Zephyr Project
OPEN SOURCE

Linux Foundation Launches agentgateway: Unified Open-Source Gateway for AI Agents and Services

2026-06-01
OpenAIOpenAI
PARTNERSHIP

OpenAI Frontier Models and Codex Now Available on AWS

2026-06-01
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us