BotBeat
...
← Back

> ▌

AnthropicAnthropic
RESEARCHAnthropic2026-05-26

Harnesses and Post-training Close the Open-Weight Bug-Finding Gap

Key Takeaways

  • ▸GLM-5.1 achieves near parity with Claude Opus on bug-finding benchmarks, while most other open-weight models (DeepSeek V4 pro, Qwen3.5-397B, Kimi K2.6) perform significantly worse
  • ▸A well-designed harness can dramatically improve open-weight model performance on vulnerability detection, closing most of the capability gap with Opus
  • ▸Post-training and fine-tuning matter more than base architecture for bug-finding and security task performance
Source:
Hacker Newshttps://vincenzoiozzo.com/blog/oss-models-vuln-research↗

Summary

A new research analysis reveals that open-weight large language models significantly underperform Anthropic's Claude Opus on vulnerability detection tasks, with the notable exception of GLM-5.1, which achieves near parity with Opus. The study benchmarks DeepSeek V4 pro, Qwen3.5-397B-A17B, Kimi K2.6, GLM-5, and GLM-5.1 against Opus 4.7 using the crackaddr vulnerability, demonstrating a substantial capability gap across most open-weight models on harder artifacts.

Crucially, the research demonstrates that a well-designed harness—the scaffolding and workflow that guides a model through systematic analysis—can substantially close the performance gap between open-weight and closed models. When using such a harness within Claude Code, most open-weight models show significant improvement in bug-finding capabilities, though they still lag behind Opus except for GLM-5.1. The analysis reveals that post-training and fine-tuning matter far more than the underlying base architecture in determining security task performance.

The research has immediate implications for export control debates around open-weight models and offensive cyber capabilities. While open-weight models can be run locally and potentially augmented through fine-tuning, the study suggests that architectural choices, post-training methodologies, and harness design significantly impact offensive potential. Open-weight models take more iterations to find bugs and struggle to recognize pattern variations, suggesting they lack the specialized security training depth found in Opus.

  • Open-weight models require more iterations and struggle with pattern recognition on variant crackaddr cases compared to Opus
  • The findings inform export control discussions by showing that capability gaps are engineering choices, not inevitable architectural limitations

Editorial Opinion

This research provides empirical evidence for ongoing export control debates by demonstrating that open-weight model threat levels depend heavily on engineering choices rather than being architecturally predetermined. The standout performance of GLM-5.1 proves the gap is not inevitable, yet the harness results shouldn't be misinterpreted as truly closing the security gap—sophisticated actors could augment these models further. The real policy insight may be that post-training transparency and fine-tuning oversight are as important as controlling model weight distribution. For defenders and policymakers, this suggests that the distinction between 'open' and 'closed' matters less than the overall capability trajectory and access to optimization techniques.

Large Language Models (LLMs)Generative AICybersecurityAI Safety & Alignment

More from Anthropic

AnthropicAnthropic
RESEARCH

Anthropic Introduces BioMysteryBench, Shows Claude Matches Human Experts in Bioinformatics Research

2026-05-26
AnthropicAnthropic
UPDATE

Anthropic Plans Public Release of Mythos AI, Admits Safeguards Don't Yet Exist

2026-05-26
AnthropicAnthropic
INDUSTRY REPORT

Enterprise AI ROI in Question: Uber's Cautionary Tale on Claude Code Spending

2026-05-26

Comments

Suggested

Moonshot AI (Kimi)Moonshot AI (Kimi)
PRODUCT LAUNCH

Moonshot AI Launches Kimi WebBridge: Browser Extension Enables AI Agents to Automate Web Tasks

2026-05-26
Goldman SachsGoldman Sachs
INDUSTRY REPORT

Goldman Sachs CEO: Fears of AI-Driven Mass Unemployment Are 'Overblown'

2026-05-26
BlockBlock
OPEN SOURCE

Block Open-Sources Goose, an AI Agent Scaled to 60% Weekly Usage Across Company

2026-05-26
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us