BotBeat
...
← Back

> ▌

Wikimedia FoundationWikimedia Foundation
OPEN SOURCEWikimedia Foundation2026-05-26

Wikimedia Releases Massive Structured Wikipedia Dataset on Hugging Face

Key Takeaways

  • ▸10.4 million structured Wikipedia articles (English & French) now available as a machine-readable dataset on Hugging Face
  • ▸Dataset includes parsed infoboxes, sections, tables, references, images, Wikidata links, and credibility signals for each article
  • ▸44.42 GB Parquet-formatted dataset optimized for model pre-training, RAG, information extraction, and knowledge base applications
Source:
Hacker Newshttps://huggingface.co/datasets/wikimedia/structured-wikipedia↗

Summary

Wikimedia Foundation has released structured-Wikipedia, a large-scale, machine-readable dataset of English and French Wikipedia articles now available on Hugging Face. The dataset contains over 10.4 million articles (7.6M English, 2.87M French) totaling 44.42 GB in Parquet format, with metadata last updated in May 2026. Each article includes parsed infoboxes, sections, tables, references with credibility signals, images, Wikidata entity links, and editor revision history—transforming Wikipedia's unstructured content into a consistently-schematized format optimized for machine learning.

The structured dataset is designed to support AI practitioners in model pre-training, retrieval-augmented generation (RAG), information extraction, entity linking, and knowledge base development. By exposing Wikipedia's rich semantic structure alongside credibility indicators, the release enables AI systems to leverage Wikipedia's knowledge while understanding citation gaps and reference reliability. The dataset is distributed under CC-BY-SA-4.0 license and can be accessed via Python's Datasets library, Polars, DuckDB, or direct Parquet access through Hugging Face Hub.

  • Includes metadata for understanding citation gaps and reference reliability, supporting higher-quality AI training data

Editorial Opinion

This release represents a significant contribution to AI infrastructure by making Wikipedia's rich, structured knowledge accessible in machine-optimized formats. The inclusion of credibility signals and reference metadata is particularly valuable—it acknowledges that not all Wikipedia content is equally reliable and gives AI systems tools to understand knowledge quality. As generative AI systems increasingly rely on retrieved knowledge, structured access to Wikipedia with credibility indicators could help mitigate hallucinations and improve answer grounding.

Generative AIMachine LearningData Science & AnalyticsOpen Source

More from Wikimedia Foundation

Wikimedia FoundationWikimedia Foundation
RESEARCH

Study Finds Algorithmic Flagging Improves Wikipedia Moderation Fairness Despite Algorithm Bias

2026-05-12

Comments

Suggested

AnthropicAnthropic
FUNDING & BUSINESS

Anthropic Closes $30 Billion Funding Round at $900+ Billion Valuation, Becoming World's Most Valuable AI Startup

2026-05-26
NVIDIANVIDIA
RESEARCH

NVIDIA Releases Polar: Scalable Reinforcement Learning Framework for Language Agents

2026-05-26
Google / AlphabetGoogle / Alphabet
PRODUCT LAUNCH

Google DeepMind Launches Gemini for Science Tools to Accelerate Scientific Discovery

2026-05-26
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us