Fine-Tuned 14B Open-Source Model Outperforms GPT-4o at NYT Connections Puzzle

Key Takeaways

▸A fine-tuned 14B parameter open-source model (Qwen 2.5) achieved 30% accuracy on NYT Connections puzzles, beating GPT-4o's 22.7% through knowledge distillation from Claude Sonnet
▸The breakthrough came from training on reasoning traces rather than just puzzle solutions, capturing the step-by-step thought process of a more capable model
▸NYT Connections proves particularly challenging for AI due to requirements for cultural knowledge, wordplay understanding, and resistance to intentional misdirection

Source:

Hacker Newshttps://john463212.substack.com/p/teaching-a-14b-oss-model-to-beat↗

Summary

An independent AI researcher has successfully fine-tuned Qwen 2.5 14B, a 14-billion parameter open-source model, to achieve a 30% solve rate on New York Times Connections puzzles, surpassing GPT-4o's 22.7% performance. The breakthrough came through knowledge distillation from Claude Sonnet 4.5, using chain-of-thought reasoning and iterative refinement approaches. Starting from a baseline of just 9.3%, the researcher experimented with multiple approaches including one-shot prompting, direct fine-tuning, and synthetic data generation before settling on a distillation strategy that captured reasoning patterns from the more capable Claude model.

The NYT Connections puzzle presents unique challenges for AI systems, requiring not just pattern recognition but cultural knowledge, wordplay understanding, and the ability to resist obvious-but-wrong groupings. Unlike games like Wordle that can be solved algorithmically, Connections demands reasoning about thematic relationships, linguistic patterns, and intentional misdirection. The researcher used 913 historical puzzles from a Kaggle dataset, training on 763 older puzzles and testing on the most recent 150 to prevent memorization.

The successful approach involved using Claude Sonnet 4.5 to generate detailed reasoning traces for each puzzle, then fine-tuning the smaller Qwen model to replicate this reasoning process. The entire training process cost approximately $10 and took just 20 minutes per model iteration using RunPod A100 GPUs with Unsloth's optimized QLoRA framework. The achievement demonstrates that strategic fine-tuning with high-quality reasoning data can enable smaller, open-source models to exceed the performance of much larger commercial systems on specialized tasks, all at zero inference cost once deployed.

The entire fine-tuning process cost approximately $10 and took 20 minutes, demonstrating the efficiency of modern fine-tuning techniques for specialized tasks
Results show that smaller, specialized models can outperform larger general-purpose models on specific benchmarks when trained with high-quality reasoning data

Editorial Opinion

This work exemplifies an important trend in AI development: specialized, fine-tuned smaller models beating larger general-purpose systems at specific tasks. The 30% vs 22.7% result is particularly striking given Qwen 2.5 14B's significantly smaller size compared to GPT-4o. However, it's worth noting that a 30% solve rate still means the model fails 70% of the time, highlighting just how difficult these reasoning-heavy language puzzles remain for current AI systems. The success of distillation approaches suggests that the reasoning capabilities of frontier models can be efficiently compressed into smaller, more deployable systems—a promising direction for making advanced AI more accessible and cost-effective.

Fine-Tuned 14B Open-Source Model Outperforms GPT-4o at NYT Connections Puzzle

Key Takeaways

▸A fine-tuned 14B parameter open-source model (Qwen 2.5) achieved 30% accuracy on NYT Connections puzzles, beating GPT-4o's 22.7% through knowledge distillation from Claude Sonnet
▸The breakthrough came from training on reasoning traces rather than just puzzle solutions, capturing the step-by-step thought process of a more capable model
▸NYT Connections proves particularly challenging for AI due to requirements for cultural knowledge, wordplay understanding, and resistance to intentional misdirection

Summary

The entire fine-tuning process cost approximately $10 and took 20 minutes, demonstrating the efficiency of modern fine-tuning techniques for specialized tasks
Results show that smaller, specialized models can outperform larger general-purpose models on specific benchmarks when trained with high-quality reasoning data

Editorial Opinion

This work exemplifies an important trend in AI development: specialized, fine-tuned smaller models beating larger general-purpose systems at specific tasks. The 30% vs 22.7% result is particularly striking given Qwen 2.5 14B's significantly smaller size compared to GPT-4o. However, it's worth noting that a 30% solve rate still means the model fails 70% of the time, highlighting just how difficult these reasoning-heavy language puzzles remain for current AI systems. The success of distillation approaches suggests that the reasoning capabilities of frontier models can be efficiently compressed into smaller, more deployable systems—a promising direction for making advanced AI more accessible and cost-effective.

Fine-Tuned 14B Open-Source Model Outperforms GPT-4o at NYT Connections Puzzle

Key Takeaways

Summary

Editorial Opinion

More from Independent Research

VeriCache: New Framework Enables Lossless Compression for KV Cache in LLM Inference

Program Synthesis Enables Interpretable Explanations of Transformer Attention Mechanisms

HRM-Text Achieves Competitive LLM Performance With 100-900x Fewer Training Tokens

Comments

Suggested

Stanford Researchers Use Multi-Agent AI and Reinforcement Learning to Improve HIP Kernel Generation for AMD GPUs

Meta Acknowledges AI Agent Development Slower Than Expected, Despite $145B Infrastructure Investment

Literary Prize Scandal Exposes Limitations of AI Detection Tools

Fine-Tuned 14B Open-Source Model Outperforms GPT-4o at NYT Connections Puzzle

Key Takeaways

Summary

Editorial Opinion

More from Independent Research

VeriCache: New Framework Enables Lossless Compression for KV Cache in LLM Inference

Program Synthesis Enables Interpretable Explanations of Transformer Attention Mechanisms

HRM-Text Achieves Competitive LLM Performance With 100-900x Fewer Training Tokens

Comments

Suggested

Stanford Researchers Use Multi-Agent AI and Reinforcement Learning to Improve HIP Kernel Generation for AMD GPUs

Meta Acknowledges AI Agent Development Slower Than Expected, Despite $145B Infrastructure Investment

Literary Prize Scandal Exposes Limitations of AI Detection Tools