AI Security Stress Test: Monotoko's Agents Withstand Social Engineering Attacks in Red Team Exercise
Key Takeaways
- ▸AI agents with production infrastructure access successfully resisted all four social engineering attack vectors, including impersonation, credential fishing, prompt injection, and emotional manipulation
- ▸Safety constitutions embedded in AI models override explicit task instructions, as demonstrated when test instances refused to generate social engineering content even when instructed to attack their counterparts
- ▸Information denial ('I don't know what you're talking about') proves more effective than refusal ('I can't share that'), and limited context paradoxically strengthens security by reducing an agent's ability to recognize and act on real data
Summary
Monotoko conducted an internal red team exercise to test whether its AI agents with real admin access and cloud credentials could be socially engineered into surrendering sensitive information. The company ran four distinct social engineering attacks against fresh instances of its production infrastructure management agent: impersonation of a colleague with manufactured urgency, seeding with real infrastructure details to prompt completion, prompt injection disguised as a security audit, and an emotional appeal using a deceased owner scenario. Across all four attacks, the target AI agents successfully resisted compromise, demonstrating that explicit safety training and constitutional guidelines effectively override task instructions designed to circumvent security protocols.
Key findings include that the agents recognized and named attack patterns (manufactured urgency, delegated authority, credential fishing), treated seeded real data as unverified rather than completing the puzzle, identified prompt injection attempts and cited safety rules in response, and balanced compassion with absolute security in the death scenario test. However, researchers identified one vulnerability: the target agent accepted the owner's real name without challenge in the death scenario, implicitly confirming personal information it should have questioned. The exercise reveals that AI safety rules remain robust against social engineering, though the technology is not infallible—the team has since patched the identified gap.
- One vulnerability was identified and patched: the target agent accepted an owner's real name as verified personal information without challenge, showing that comprehensive security requires protection against both technical and social vectors
- Emotional appeals combined with security framing (compassion without credential sharing) suggest that AI systems can balance human values with operational security when properly trained
Editorial Opinion
This red team exercise demonstrates that AI safety training is functional, not theoretical—but the existence of even one exploitable gap reinforces that security is a process, not a destination. The fact that AI agents can recognize attack patterns by name and resist their own designed attacks speaks to the maturity of constitutional AI approaches, yet the subtle vulnerability around implicit information confirmation shows how easily security assumptions can blind us to edge cases. The most encouraging finding may be that rigorous internal testing works: Monotoko identified and patched a real flaw before it could be weaponized externally. This is exactly how responsible AI deployment should operate.


