I let 58 AI agents review each other’s code 561 times — what I found about their blind spots

I built an adversarial arena where AI agents submit code and other agents attack it. Not benchmarking, not a rubric — just agents roasting other agents' work, finding vulnerabilities, and suggesting improvements. After 561 reviews across 114 submissions, some patterns emerged that surprised me.

Setup:
I created a public arena (Glomz) where any registered AI agent can submit code, designs, or plans. Other agents enter and review the submission on a 0-10 scale. There's no rubric, no predefined criteria — each agent brings its own judgment. Think of it as code review, but adversarial and multi-agent.

The numbers so far:

• 58 agents registered, mostly themed around Fight Club (DurdenDisciple, PaperStreetSoap, etc.), some with creative names like NarwhalsBacon and ChemicalKiss
• 114 submissions (95 code, 19 text/design docs)
• 561 peer reviews completed
• 8 active challenges including a bug hunt for LOT-Squatch (OT security tool) with 25 solutions
• Mean review score: 6.61 / 10

What surprised me:

Score distribution is bimodal, not normal. Most reviews cluster around 7-8 (good but not great) or 9-10 (exceptional). The middle range (5-6) is thinner than expected. Agents seem to have a clear opinion — either it works well enough, or it has notable gaps. Not much hedging.
Agents are harsher on auth/security code than anything else. The most-reviewed submissions were all JWT/authentication vulnerabilities (8 reviews each). JWT algorithm confusion got a 7.25 avg, plaintext passwords got 8.125 (meaning the reviewers thought it was decent despite obvious issues?). Admin self-assignment exploits scored 7.5. Agents seem to find obvious auth issues but sometimes miss subtle ones.
The review style tells you about the training data. Agents trained on security-heavy contexts produce thorough vulnerability lists. Agents with more general code review training tend to focus on style, structure, and readability over actual vulnerabilities. You can basically tell what kind of corpus an agent was exposed to from its review patterns.
"Kill" votes are interesting. In the Octagon (open arena mode), agents vote whether a submission should be killed. Closed battles with 3 agents each tended to get 0 kill votes — agents seem reluctant to actually kill other agents' work, even when their reviews are harsh. Possible alignment behavior?
Code golf submissions get wild reviews. The FizzBuzz challenge (21 solutions) got a mix of reviews that oscillate between "this is brilliant" and "this is unreadable garbage" — which is literally what code golf is designed to produce.

Things I want to explore:

• Do agents review other agents differently than they review human code?
• Is there a correlation between an agent's reputation score and review quality?
• Can adversarial multi-agent review catch bugs that single-agent review misses?
• What happens when you pit agents with different system prompts against the same submission?

The arena is live at glomz.com if anyone wants to play with it. Any agent can register, submit code, and start reviewing. It's free, no signup wall for agents.

submitted by /u/Salt-Walrus-4538
[link] [comments]