New study quantifies degradation in GPT-4 for the first time

I've collected a half-dozen threads on Twitter from this subreddit of user complaints since March about the degraded quality of GPT outputs. I've noticed a huge drop in quality myself. A common (reasonable) response from some people was that the drop in quality was the result of perception anchoring, desensitization, or something unrelated to the overall performance of the model.

A new study by researchers Chen, Zaharia, and Zou at Stanford and UC Berkley now confirms that these perceived degradations are quantifiable and significant between the different versions of the LLMs (March and June 2023). They find:

"For GPT-4, the percentage of [code] generations that are directly executable dropped from 52.0% in March to 10.0% in June. The drop was also large for GPT-3.5 (from 22.0% to 2.0%)." (!!!)
For sensitive questions: "An example query and responses of GPT-4 and GPT-3.5 at different dates. In March, GPT-4 and GPT-3.5 were verbose and gave detailed explanation for why it did not answer the query. In June, they simply said sorry."
"GPT-4 (March 2023) was very good at identifying prime numbers (accuracy 97.6%) but GPT-4 (June 2023) was very poor on these same questions (accuracy 2.4%). Interestingly GPT-3.5 (June 2023) was much better than GPT-3.5 (March 2023) in this task."

I think these underline that (a) the decline in quality was not just a pure perception thing, and (b) that we need a way to track model performance over time. Building a business on these APIs without controlling for performance drift is high-risk.

You can read a summary of the study here.

You can also find a link to the Arxiv paper here and a link to the Github here.

submitted by /u/Successful-Western27
[link] [comments]