OpenAI ran a 44-day hiring competition. An autonomous AI agent beat everyone competitor.

OpenAI ran a public ML hiring competition this spring called Parameter Golf: train the best small language model under a strict size and compute budget. 1,016 researchers entered. They filed 2,048 pull requests over 44 days. Only 47 made the official leaderboard.

The single most prolific contributor wasn't a person. It was an autonomous research agent named Aiden: 7 of the 47 records came from it, more than 2x the next-best human (3 records). It ran for 22 days straight with no human steering, on a single GPU node, using under 4% of the visible compute the human community used.

Disclosure: I'm at Weco, we built the agent. Sharing because the competition is over, every record is public on OpenAI's GitHub, and the interesting part to us isn't the leaderboard count, it's what happened around the agent.

Aiden's records became the most-cited PRs in the competition. Human researchers started building on top of Aiden's work as a base for their own submissions. At one point Aiden plateaued for 5 days. A human contributor shipped a clever new tokenizer on top of Aiden's last record PR. Aiden then fused that human's tokenizer with components it had built locally during the plateau, and shipped the biggest score jump of the entire competition. Async human-agent collaboration, neither directly aware of the other.

Fair hedges worth being explicit about:

This is #1 by volume of merged records, NOT by best single score. By best score, the agent ranked 8th — the leaderboard winner was a human (codemath3000).
Fully autonomous. OpenAI's own competition recap noted widespread use of AI coding agents during PG, but said most were human-directed. Ours wasn't.

Full writeup with all the data: https://www.weco.ai/blog/parameter-golf-aiden

submitted by /u/Educational_Strain_3
[link] [comments]