Currently videos look a lot like this.
Until the models grasp physics, causality, can model natural human behavior etc it will look a lot like this. This understanding is a necessary condition for modeling what happens next in a given video.
Put another way is IF AND WHEN we have turing test level video generation we also have a reality predictor, i.e. to the model there's no difference predicting what happens next from a video frame vs predicting what happens next in a car crash, based on visual input. Now if the inference time gets low enough (and the trend is that inference time and cost drops very quickly) we can predict frames faster than they happen, i.e. real time prediction into the future.
People often mention quantum uncertainty to me. However, this uncertainty is not large enough to violate the laws of classical physics - we can still predict exactly where rockets and planets will be in the future even with all the quantum uncertainty. A car crash still follows classical physics, i.e. motion, energy etc.
So I say all of this because I think at this point most people will agree that we will sooner or later have turing level video generation. But not everyone would agree that we will have something that could accurately predict something like the details of a future car crash. But I think they are the same? Essentially a video generator IS a lacplacian demon, and if you don't have a laplacian demon you don't have turing level video generation. What's the difference between predicting the next video frame, vs predicting the next instance of reality?
[link] [comments]