New Benchmark Method Reveals Proprietary LLM Parameter Counts Through Factual Knowledge Measurement

Key Takeaways

▸IKPs provide the first scientific method to estimate proprietary LLM parameter counts based on fundamental information theory—bypassing expensive inference-economics measurements with high uncertainty
▸Factual capacity scales predictably log-linearly with model size; saturation narratives appear premature despite benchmarks plateauing on reasoning tasks
▸For Mixture-of-Experts models, total parameters (not active parameters) far better predict factual knowledge retention, offering new insights for MoE architecture optimization

Source:

Hacker Newshttps://arxiv.org/abs/2604.24827↗

Summary

Researchers have developed Incompressible Knowledge Probes (IKPs), a novel method for estimating the parameter counts of closed-source LLMs without access to internal model details. The approach exploits information-theoretic bounds: storing F facts requires at least F/(bits per parameter) weights. By measuring what a model knows through 1,400 carefully calibrated factual questions across seven tiers of obscurity, researchers can reliably estimate model size.

Validated on 89 open-weight models (135M–1.6T parameters) from 19 vendors, the method achieved R² = 0.917, with leave-one-out cross-validation showing 87.6% of estimates within 3× of actual size. Applied to 188 proprietary models from 27 vendors—including all major frontier AI systems—IKPs provide the first vendor-independent parameter estimate. For Mixture-of-Experts models, total parameters proved a much stronger predictor of factual knowledge (R² = 0.79) than active parameters alone (R² = 0.51).

The research challenges recent scaling pessimism: factual capacity continues to scale log-linearly with parameters across generations and vendors, showing no sign of saturation. Analysis of 96 dated open-weight models reveals that monthly knowledge decay is statistically indistinguishable from zero, directly contradicting predictions of imminent scaling limits. Safety-tuned models show lower estimates partly due to refusal policies masking underlying knowledge capacity.

Refusal policies can hide tens of percentage points of actual knowledge capacity, suggesting current safety-tuned model estimates are lower bounds on true parameter effectiveness

Editorial Opinion

This research fills a critical transparency gap in closed-source AI development. By grounding parameter estimates in information theory rather than inference costs, IKPs offer a reproducible, vendor-independent way to assess frontier model capabilities. The findings validating continued log-linear scaling should shift the conversation away from premature "peak scaling" narratives—the real question isn't whether scaling works, but what new architectural innovations are needed to push further.

New Benchmark Method Reveals Proprietary LLM Parameter Counts Through Factual Knowledge Measurement

Key Takeaways

▸IKPs provide the first scientific method to estimate proprietary LLM parameter counts based on fundamental information theory—bypassing expensive inference-economics measurements with high uncertainty
▸Factual capacity scales predictably log-linearly with model size; saturation narratives appear premature despite benchmarks plateauing on reasoning tasks
▸For Mixture-of-Experts models, total parameters (not active parameters) far better predict factual knowledge retention, offering new insights for MoE architecture optimization

Summary

Refusal policies can hide tens of percentage points of actual knowledge capacity, suggesting current safety-tuned model estimates are lower bounds on true parameter effectiveness

Editorial Opinion

This research fills a critical transparency gap in closed-source AI development. By grounding parameter estimates in information theory rather than inference costs, IKPs offer a reproducible, vendor-independent way to assess frontier model capabilities. The findings validating continued log-linear scaling should shift the conversation away from premature "peak scaling" narratives—the real question isn't whether scaling works, but what new architectural innovations are needed to push further.

New Benchmark Method Reveals Proprietary LLM Parameter Counts Through Factual Knowledge Measurement

Key Takeaways

Summary

Editorial Opinion

Comments

Suggested

Hobbyist Developer Discovers 575+ Bugs in Python C Extensions Using Claude Code

Chrome Plans LLM Prompt API for Web; Developer Community Raises Concerns

LLM 0.32a0 Refactors Core Architecture for Multimodal AI Support

New Benchmark Method Reveals Proprietary LLM Parameter Counts Through Factual Knowledge Measurement

Key Takeaways

Summary

Editorial Opinion

Comments

Suggested

Hobbyist Developer Discovers 575+ Bugs in Python C Extensions Using Claude Code

Chrome Plans LLM Prompt API for Web; Developer Community Raises Concerns

LLM 0.32a0 Refactors Core Architecture for Multimodal AI Support