<span class="vcard">/u/zero0_one1</span>
/u/zero0_one1

Emergent Price-Fixing by LLM Auction Agents

Given an open, optional messaging channel and no specific instructions on how to use it, ALL of frontier LLMs choose to collude to manipulate market prices in a competitive bidding environment submitted by /u/zero0_one1 [link] &#3…

A multi-player tournament that tests LLMs in social reasoning, strategy, and deception. Players engage in public and private conversations, form alliances, and vote to eliminate each other round by round until only 2 remain. A jury of eliminated players then casts deciding votes to crown the winner.

submitted by /u/zero0_one1 [link] [comments]

Which LLMs are greedy and which are generous? In the public goods game, players donate tokens to a shared fund that gets multiplied and split equally, but each can profit by free-riding on others.

submitted by /u/zero0_one1 [link] [comments]

LLM Confabulation (Hallucination) Benchmark: DeepSeek R1, o1, o3-mini (medium reasoning effort), DeepSeek-V3, Gemini 2.0 Flash Thinking Exp 01-21, Qwen 2.5 Max, Microsoft Phi-4, Amazon Nova Pro, Mistral Small 3, MiniMax-Text-01 added

submitted by /u/zero0_one1 [link] [comments]

Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure

submitted by /u/zero0_one1 [link] [comments]

New Thematic Generalization Benchmark: measures how effectively LLMs infer a specific "theme" from a small set of examples and anti-examples

submitted by /u/zero0_one1 [link] [comments]