Data Pruning Strategy Enables LLMs to Memorize 1.3X More Facts With Smaller Models
Key Takeaways
- ▸Data pruning and frequency flattening can improve LLM fact memorization by reducing information overload within model capacity constraints
- ▸A 110M-parameter model trained with the proposed data selection method matched the factual accuracy of a 1.3B-parameter model on standard training
- ▸The research provides an information-theoretic framework for understanding fact memorization and its relationship to model capacity and training data distribution
Summary
Researchers at Google have published a paper demonstrating that strategic training data pruning can significantly improve how well large language models memorize factual knowledge. The study, accepted at ICLR 2026's Workshop on Navigating and Addressing Data Problems for Foundation Models, formalizes fact memorization from an information-theoretic perspective and identifies how training data distributions affect factual accuracy. The researchers found that fact accuracy becomes suboptimal when the amount of information in training data exceeds model capacity, particularly when facts follow a skewed frequency distribution like a power law.
The team proposes data selection schemes based on training loss that reduce the number of facts in training data and flatten their frequency distribution. In experiments on semi-synthetic datasets, their method boosted fact accuracy to the model's theoretical capacity limit. Most notably, when pretraining a GPT2-Small model (110M parameters) on an annotated Wikipedia corpus using their selection method, the smaller model memorized 1.3X more entity facts than standard training—matching the performance of a 10X larger model (1.3B parameters) trained on the full dataset. This approach addresses a critical challenge in LLMs: hallucinations and poor performance on knowledge-intensive tasks caused by insufficient fact memorization.
- Strategic training data curation could enable smaller, more efficient models to achieve knowledge-intensive performance comparable to much larger models
Editorial Opinion
This research addresses a fundamental tension in modern LLMs: the gap between model capacity and the volume of factual information in training data. By proving that smarter data curation can achieve the same performance with significantly smaller models, the work has important implications for efficiency and sustainability in AI development. If these findings generalize beyond the tested scenarios, data pruning strategies could become a standard optimization technique, reducing computational requirements while improving factual accuracy—a win-win for both performance and environmental impact.


