SzPredict: Open Benchmark Exposes Seizure Prediction Field's Generalization Crisis
Key Takeaways
- ▸Methodological crisis: Across 19 recent papers on the same dataset, sensitivities range from 58% to 100% due to incompatible evaluation protocols—most published results aren't directly comparable
- ▸Patient-specific vs. generalizable: Nearly all published work trains and tests on the same patient, achieving impressive results that don't translate to clinical scenarios where models must work on unseen patients
- ▸Honest baseline assessment: All six included baselines fail to meet clinically relevant thresholds under cross-patient evaluation, establishing a realistic research target
Summary
HyperReal has released SzPredict, an open-source benchmark for EEG-based seizure prediction that exposes a critical methodological crisis in the field. Most published seizure prediction results aren't actually comparable because researchers use incompatible task definitions, preictal windows, patient cohorts, and post-processing rules. Across a review of 19 recent papers using the same CHB-MIT database, reported sensitivities range from 58% to nearly 100%—not because models differ significantly, but because success is measured differently.
The core problem: most published research reports patient-specific accuracy (train and test on the same patient), achieving 95%+ but revealing nothing about real clinical utility—whether a model can predict seizures in new, unseen patients. SzPredict standardizes evaluation across four protocols, with Protocol 3 (cross-patient fixed split) as the primary benchmark for practical deployment, and Protocol 4 measuring the clinically relevant metric: how many minutes before seizure onset does the model provide warning?
Notably, all six included baselines fail to achieve clinically acceptable performance on Protocol 3. This honest assessment resets field expectations and provides a realistic target for future work. The benchmark is MIT-licensed, includes the full CHB-MIT Scalp EEG Database (24 pediatric subjects, 844+ hours of continuous EEG, 198 annotated seizures), and can be set up in five minutes from git clone.
- Standardized protocols: SzPredict pins down four evaluation protocols, with Protocol 4 measuring clinical utility through 'time-to-seizure-warning' rather than abstract accuracy metrics
Editorial Opinion
SzPredict fills a critical gap in AI validation: it moves seizure prediction from 'what accuracy looks good in a paper?' to 'will this actually work for a new patient?' The fact that all baselines fail is the benchmark's greatest strength—it's a clear signal that the field has been measuring the wrong things. This kind of honest, community-focused benchmarking is what accelerates real progress, especially in healthcare where claimed improvements must translate to clinical utility.



