Why forecasting AI performance is tricky: the following 4 trends fit the observed data equally as well

I was trying to replicate a forecast found on AI 2007 and thought it'd be worth pointing out that any number of trends could fit what we've observed so far with performance gains in AI, and at this juncture we can't use goodness of fit to differentiate between them. Here's a breakdown of what you're seeing:

The blue line roughly coincides with AI 2027's "benchmark-and-gaps" approach to forecasting when we'll have a super coder. 1.5 is the line where a model would supposedly beat 95% of humans on the same task (although it's a bit of a stretch given that they're using the max score obtained on multiple runs by the same model, not a mean or median).
Green and orange are the same type of logistic curve where different carrying capacities are chosen. As you can see, assumptions made about where the upper limit of scores on the RE-Bench impact the shape of the curve significantly.
The red curve is a specific type of generalized logistic function that isn't constrained to symmetric upper and lower asymptotes.
I threw in purple to illustrate the "all models are wrong, some are useful" adage. It doesn't fit the observed data any worse than the other approaches, but a sine wave is obviously not a correct model of technological growth.
There isn't enough data for data-driven forecasting like ARIMA or a state-space model to be useful here.

Long story short in the absence of data, these forecasts are highly dependent on modeling choices - they really ought to be viewed as hypotheses that will be tested by future data more than an insight into what that data is likely to look like.

submitted by /u/Murky-Motor9856
[link] [comments]