Nous Research Achieves 2-3x Faster LLM Pretraining with Token Superposition Training
Key Takeaways
- ▸2-3x wall-clock speedup at matched FLOPs without any changes to model architecture, optimizer, tokenizer, or training data
- ▸Inference-time model is identical to conventionally trained models—efficiency gains are training-only
- ▸Validated across scales from 270M to 10B-A1B MoE models, with improvements on standard benchmarks (MMLU, ARC, HellaSwag)
Summary
Nous Research introduces Token Superposition Training (TST), a novel modification to the standard LLM pretraining loop that achieves 2-3x wall-clock speedup on pretraining at matched FLOPs without changing the final model architecture, optimizer, tokenizer, or training data. The method works in two phases: during the first 20-40% of training, the model processes contiguous bags of tokens by averaging their embeddings on the input side and predicting the next bag with modified cross-entropy on the output side. For the remainder of training, it returns to standard next-token prediction.
The research validates TST across multiple scales—270M, 600M, and 3B dense models, as well as a 10B-A1B mixture-of-experts model trained to 2T tokens. On the 10B-A1B MoE, TST reached lower final training loss than a matched-FLOPs baseline in roughly 40% of wall-clock time and outperformed the baseline on standard benchmarks including HellaSwag, ARC-Easy, ARC-Challenge, and MMLU. Critically, the inference-time model is identical in every respect to one produced by conventional pretraining—only the training loop changes.
The technique is motivated by two key insights: that subword tokenization efficiency comes largely from processing shorter sequences rather than learned semantics, and that training-time and inference-time efficiency can be decoupled separately. This challenges the conventional coupling between training optimization and model architecture changes seen in most prior efficiency improvements.
- Decouples training-time efficiency from inference-time constraints, a principle violation most prior efficiency methods make
- Achieved through a conceptually simple technique: reshaping tokens into bags, averaging embeddings, and using modified cross-entropy loss
Editorial Opinion
Token Superposition Training represents an elegant and practically important breakthrough in LLM efficiency. By decoupling training-time optimization from inference-time architecture—a constraint most efficiency methods violate—TST opens a path to faster model development without the usual trade-offs. The fact that such substantial speedups come from a relatively simple technique (reshaping, averaging embeddings, and modified loss) suggests there may be significant untapped efficiency gains in how we structure the pretraining process itself. This work has immediate implications for reducing the computational cost of developing and scaling new models.


