Exabase Achieves State-of-the-Art on Memory Benchmark Using Smaller, Cheaper Models
Key Takeaways
- ▸Achieved state-of-the-art on LongMemEval benchmark with 96.4% accuracy at top-50 using Gemini 3 Flash, not a frontier model
- ▸Demonstrated that superior memory performance doesn't require expensive, oversized models—challenging the industry's scale-first approach
- ▸Published transparent methodology without question-specific prompt tuning, setting a standard for production-realistic evaluation in long-term memory research
Summary
Exabase announced Mneme-1 (M-1), its first-generation long-term memory engine, achieving state-of-the-art results on LongMemEval, the most comprehensive benchmark for conversational memory retrieval. The system reached 96.4% accuracy at top-50 recall depth using Gemini 3 Flash, a smaller and cheaper model, without question-specific prompt engineering or large frontier models.
The breakthrough addresses a critical gap in AI systems, where long-term memory has been both poorly evaluated and rarely tested under production-realistic conditions. Long-term memory in AI systems mirrors human memory—reconstructive, associative, and temporally sensitive rather than simple database lookups. This capability is essential for building AI systems that can maintain meaningful context across conversations and sessions.
Exabase's methodology emphasizes transparent, reproducible evaluation and acknowledges inherent ceiling effects in the benchmark itself. By refusing to rely on frontier models or prompt engineering tricks, the team demonstrates that progress on memory doesn't require brute-force scaling. This work sets a new standard for responsible evaluation in the long-term memory research space.
Editorial Opinion
Long-term memory remains one of the least solved problems in production AI, so Exabase's state-of-the-art results on a rigorous public benchmark are genuinely valuable. What makes this work stand out is the insistence on production-realistic conditions—no frontier models, no prompt engineering—proving that better memory systems don't require brute-force scaling. This is the kind of unglamorous but essential research the field needs.



