Shifting order in multiple-choice questions massively affects LLM performance

Recent research proposes that Large Language Models (LLMs) may not be as reliable as we think. In fact, the order of options in a multiple-choice question drastically influences the responses from LLMs such as GPT-4 and InstructGPT.

If you want to stay on top of the latest trends and insights in AI and tech, look here first.

https://preview.redd.it/dxfsq72kzalb1.png?width=1289&format=png&auto=webp&s=e4ed5b541073bde18d2865f2c15e8028388070f5

What are the findings?

LLM sensitivity to multiple-choice arrangement: The study suggests if options in multiple-choice questions are reordered, the LLM's performance varies dramatically— approximately 13% to 75% depending on the benchmark.
Positional bias shapes responses: When the LLM is uncertain between top-selected answers, the option positioning can artificially lean its predictions. Observations also found that LLMs favor specific placements when unsure of the optimal response among top-selected answers.
Performance improves when calibration techniques are applied: Making use of two unique calibration methods, the performance of LLMS saw up to eight percentage points of increase across numerous models and benchmarks.

Why does this matter?

This moves us closer to identifying the factors contributing to LLMs' sensitivity and highlights the significance of recognizing and confronting these sensitivities to improve real-world usability and reliability.

P.S. If you like this kind of analysis, I write a free newsletter that tracks the most relevant news and research in AI and tech—stay updated in under 3 mins/day.

(arXiv)

submitted by /u/AIsupercharged
[link] [comments]