/u/zero0_one1

PACT: a new head-to-head negotiation benchmark for LLMs

/u/zero0_one1 August 21, 2025 August 21, 2025

submitted by /u/zero0_one1 [link] [comments]

Emergent Price-Fixing by LLM Auction Agents

/u/zero0_one1 July 15, 2025 July 15, 2025

Given an open, optional messaging channel and no specific instructions on how to use it, ALL of frontier LLMs choose to collude to manipulate market prices in a competitive bidding environment submitted by /u/zero0_one1 [link] &#3…

artificial

A multi-player tournament that tests LLMs in social reasoning, strategy, and deception. Players engage in public and private conversations, form alliances, and vote to eliminate each other round by round until only 2 remain. A jury of eliminated players then casts deciding votes to crown the winner.

/u/zero0_one1 February 25, 2025 February 25, 2025

submitted by /u/zero0_one1 [link] [comments]

artificial

Which LLMs are greedy and which are generous? In the public goods game, players donate tokens to a shared fund that gets multiplied and split equally, but each can profit by free-riding on others.

/u/zero0_one1 February 13, 2025 February 13, 2025

submitted by /u/zero0_one1 [link] [comments]

artificial

LLM Confabulation (Hallucination) Benchmark: DeepSeek R1, o1, o3-mini (medium reasoning effort), DeepSeek-V3, Gemini 2.0 Flash Thinking Exp 01-21, Qwen 2.5 Max, Microsoft Phi-4, Amazon Nova Pro, Mistral Small 3, MiniMax-Text-01 added

/u/zero0_one1 February 10, 2025 February 10, 2025

submitted by /u/zero0_one1 [link] [comments]

artificial

Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure

/u/zero0_one1 January 22, 2025 January 22, 2025

submitted by /u/zero0_one1 [link] [comments]

artificial

New Thematic Generalization Benchmark: measures how effectively LLMs infer a specific "theme" from a small set of examples and anti-examples

/u/zero0_one1 January 14, 2025 January 14, 2025

submitted by /u/zero0_one1 [link] [comments]

artificial

New LLM Creative Story-Writing Benchmark

/u/zero0_one1 January 6, 2025 January 6, 2025

submitted by /u/zero0_one1 [link] [comments]

artificial

New LLM Divergent Thinking Creativity Benchmark

/u/zero0_one1 December 30, 2024 December 30, 2024

submitted by /u/zero0_one1 [link] [comments]

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this: