BotBeat
...
← Back

> ▌

WontoposWontopos
OPEN SOURCEWontopos2026-04-01

WMB-100K: New Open Benchmark Tests AI Memory Systems at Scale with 4.3M Tokens

Key Takeaways

  • ▸WMB-100K is the largest open memory benchmark to date, with 4.3M tokens and 2,708 situational questions across multiple difficulty levels and question types
  • ▸The benchmark isolates memory retrieval accuracy from LLM reasoning ability, measuring only whether memory systems return relevant information for real situations
  • ▸Includes 400 false memory detection probes to penalize incorrect retrievals, addressing a critical production concern often overlooked in prior benchmarks
Source:
Hacker Newshttps://github.com/Irina1920/WMB-100K↗

Summary

Wontopos has released WMB-100K, an enterprise-scale open benchmark designed to evaluate AI memory systems at unprecedented scale. The benchmark stores 4.3M tokens across 2.3M documents and 105K conversation turns, then tests whether memory systems can retrieve the right information for 2,708 real-world situational questions. Unlike previous benchmarks that measure LLM reasoning ability, WMB-100K isolates and measures only what memory systems should do: accurate retrieval and false memory defense.

The benchmark represents a significant evolution from Wontopos's earlier work, upgrading from simple fact lookup questions to situational reasoning that mirrors real production scenarios. It includes seven question types ranging from single-memory retrieval to complex multi-step reasoning chains, plus 400 adversarial false memory probes to penalize incorrect retrievals. The scoring system explicitly excludes LLM interpretation—memory systems pass when they return the right memories for the situation, regardless of how an LLM later interprets them.

WMB-100K offers two evaluation modes: Quick Mode using GPT-4o-mini for self-testing (~$0.36), and Official Mode using three LLMs with majority voting for leaderboard submissions (~$1.16). This dual-mode approach prevents bias toward any single judge while keeping development costs accessible. The benchmark's design reflects a growing recognition that memory systems require distinct evaluation from language generation—they serve as information retrieval layers that feed LLMs, not as reasoning engines themselves.

  • Offers both Quick Mode (self-testing) and Official Mode (verified leaderboard) evaluation to balance accessibility and credibility
  • Represents industry recognition that memory systems are distinct components requiring specialized evaluation beyond traditional LLM benchmarks

Editorial Opinion

WMB-100K fills a critical gap in AI evaluation infrastructure. As memory systems become essential components of production AI applications, isolating their performance from downstream LLM reasoning is both methodologically sound and practically valuable. The benchmark's explicit separation of concerns—pure retrieval accuracy versus interpretation—could accelerate development of specialized, high-performance memory systems. The false memory defense mechanism is particularly important, as hallucinated or irrelevant memory retrievals are costly production failures that many benchmarks ignore.

Machine LearningData Science & Analytics

Comments

Suggested

Research CommunityResearch Community
RESEARCH

TELeR: New Taxonomy Framework for Standardizing LLM Prompt Benchmarking on Complex Tasks

2026-04-05
N/AN/A
RESEARCH

Machine Learning Model Identifies Thousands of Unrecognized COVID-19 Deaths in the US

2026-04-05
Not ApplicableNot Applicable
INDUSTRY REPORT

Massive Seven-Year Study Reveals Only Half of Social Science Research Can Be Replicated

2026-04-05
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us