Cisco Tests AI for Security Reports, Finds 50% Time Savings But Significant Reliability Gaps
Key Takeaways
- ▸LLMs can reduce security report drafting time by 50% when given granular, single-task instructions and clear output formatting rules
- ▸LLMs exhibit four critical failure modes: non-reproducible outputs, inconsistent conclusions, unpredictable document structure, and potential data loss
- ▸Cross-contamination between reports within a single session is a significant risk; separate sessions are required for each incident report
Summary
Cisco's Talos Incident Response team tested large language models (LLMs) for writing security incident reports based on tabletop exercises and published detailed findings on the technology's promise and pitfalls. While the team achieved a 50% reduction in report drafting time using LLMs with carefully crafted prompts, they also documented critical failure modes including hallucinations, inconsistent output, cross-contamination between reports, and unpredictable formatting. The research revealed that LLMs can deliver 'significant inaccuracies, unusual conclusions, and inconsistent writing styles' because they fundamentally operate as sophisticated autocomplete systems making probabilistic guesses rather than reasoning engines. Cisco's approach—using granular single-task instructions, specifying source materials, and enforcing output formatting rules—proved effective in controlled environments, though the team cautioned that spell-checking and grammar-checking prompts remain unsuitable for production use.
- Quality assurance testing found blind reviewers could not distinguish AI-generated reports from human ones, suggesting the approach is viable with proper guardrails
Editorial Opinion
Cisco's honest assessment of AI's limitations in high-stakes security reporting is refreshingly candid. While the 50% time savings is attractive for resource-constrained teams, the requirement for extensive manual review, session isolation, and careful prompt engineering suggests LLMs are still best viewed as assistants rather than autonomous report writers. The finding that spell-checking and grammar-checking prompts are unreliable is particularly concerning and underscores a broader truth: LLMs excel at pattern matching but fail at systematic accuracy—a critical gap in domains like cybersecurity where precision directly impacts organizational safety.



