Beware of Unreliable Data in Model Evaluation: Case Study w/ Flan T5 LLM

Hello Redditors!

LLMs have undoubtedly established themselves as the vanguards of natural language processing, consistently pushing the limits of language comprehension and generation, thus solidifying their prominent position in the field.

It's pretty common now for data scientists and ML engineers to validate the quality of their training data being fed into these LLMs, but what about their test data used to evaluate them?

I spent some time playing around with the FLAN-T5 open-source LLM from Google Research and I discovered that noisy test/evaluation data can actually cause you to choose sub-optimal prompts.

Using the observed accuracy, you would select Prompt A as better. Prompt B is actually the better prompt when evaluated on the clean test set.

Given two prompts A and B, I found multiple cases where prompt A performed better on the observed (noisy) test data, yet worse on the high-quality test data. In reality, this means that you would choose A as the "best prompt" when prompt B is actually the better one. I also proved the accuracy difference to be significant via McNemar’s Test.

I wrote up a quick article that explains my methodology and how I used data-centric AI to automatically clean the noisy test data in order to ensure optimal prompt selection.

Let me know what you think!

submitted by /u/cmauck10
[link] [comments]