Hierarchical Winner-Take-All Circuits Achieve Superior Compositional Reasoning with Dramatically Fewer Parameters
Key Takeaways
- ▸HWTA circuits achieve compositional generalization with orders of magnitude fewer parameters than transformers—a 164-param model beats a 6.5M-param transformer by 94 points on SCAN
- ▸The architecture replaces attention with discrete routing and slot-based message passing, where slots communicate by carrying source state information enabling multi-hop reasoning
- ▸The content-structure separation principle—learning structure while passing content unchanged—generalizes across different task types (graphs, sequences, arithmetic) with task-specific instantiations
Summary
A new neural architecture called Hierarchical Winner-Take-All (HWTA) circuits has demonstrated remarkable performance on compositional reasoning tasks, beating transformer baselines by significant margins while using substantially fewer parameters. Most notably, a 164-parameter HWTA model achieved 100% accuracy on the SCAN add-jump benchmark, outperforming a 6.5M-parameter transformer by 93.6 points. The architecture replaces softmax attention with discrete routing over a fixed slot bank, enabling more efficient information flow and reasoning.
The key architectural innovation involves enabling slots to communicate by carrying source-slot state information through message passing. This one-line fix allows information to flow along graph edges through the routing structure over multiple propagation steps, enabling genuine multi-hop reasoning to emerge. The principle of content-structure separation—where structure is learned while content passes through unchanged—is instantiated differently for different task types, using edge-aware gather-scatter operations for graph tasks, positional slot buffers for sequence tasks, and bilinear operation tables for arithmetic tasks.
HWTA circuits demonstrated consistent superiority across five compositional reasoning benchmarks: CLUTRR (+44.0 points at k=10 with matched parameters), SCAN variants (+78.7 to +93.6 points), CruxMini (+84.8 points), ListOps (+58.3 points), and Graph reasoning tasks (up to +9.2 points). The reproducible results span multiple seeds and include fully open-source implementations, from the minimal 164-parameter symbolic variant to fully-learned models with up to 1.5B parameters.
- Results are consistent across five benchmarks and fully reproducible with open-source code, suggesting HWTA may be a fundamental alternative to attention for compositional tasks
Editorial Opinion
This research challenges the near-universal assumption that attention is necessary for reasoning tasks and demonstrates that discrete routing over learned slot structures can achieve superior compositional generalization with dramatically fewer parameters. If these results hold up to scrutiny and generalize beyond the tested benchmarks, this could represent a significant shift in how we design architectures for reasoning. The emphasis on reproducibility, full code release, and honest reporting of ablations across multiple seeds sets a strong standard for ML research.



