Wikimedia Releases Massive Structured Wikipedia Dataset on Hugging Face
Key Takeaways
- ▸10.4 million structured Wikipedia articles (English & French) now available as a machine-readable dataset on Hugging Face
- ▸Dataset includes parsed infoboxes, sections, tables, references, images, Wikidata links, and credibility signals for each article
- ▸44.42 GB Parquet-formatted dataset optimized for model pre-training, RAG, information extraction, and knowledge base applications
Summary
Wikimedia Foundation has released structured-Wikipedia, a large-scale, machine-readable dataset of English and French Wikipedia articles now available on Hugging Face. The dataset contains over 10.4 million articles (7.6M English, 2.87M French) totaling 44.42 GB in Parquet format, with metadata last updated in May 2026. Each article includes parsed infoboxes, sections, tables, references with credibility signals, images, Wikidata entity links, and editor revision history—transforming Wikipedia's unstructured content into a consistently-schematized format optimized for machine learning.
The structured dataset is designed to support AI practitioners in model pre-training, retrieval-augmented generation (RAG), information extraction, entity linking, and knowledge base development. By exposing Wikipedia's rich semantic structure alongside credibility indicators, the release enables AI systems to leverage Wikipedia's knowledge while understanding citation gaps and reference reliability. The dataset is distributed under CC-BY-SA-4.0 license and can be accessed via Python's Datasets library, Polars, DuckDB, or direct Parquet access through Hugging Face Hub.
- Includes metadata for understanding citation gaps and reference reliability, supporting higher-quality AI training data
Editorial Opinion
This release represents a significant contribution to AI infrastructure by making Wikipedia's rich, structured knowledge accessible in machine-optimized formats. The inclusion of credibility signals and reference metadata is particularly valuable—it acknowledges that not all Wikipedia content is equally reliable and gives AI systems tools to understand knowledge quality. As generative AI systems increasingly rely on retrieved knowledge, structured access to Wikipedia with credibility indicators could help mitigate hallucinations and improve answer grounding.



