I don't know if this is how Google Gemini's thought process but here would be my architectural idea of how this could work.
Something like a Jetson Orin or Nano would be a perfect vehicle to test this out.
Effectively, you would take the computer vision aspect of the Jetson device and process all still images and place them into a table where you would run the an LLM/Model analysis on the frames output description. You would have to prompt them into some defined structure.
Then, an AI model would do a predictive motion analysis about what the next frames motions are and predicted to be; In other words, the essence of what motion is.
This would be the Vision of Thoughts (VoT) engine. Effectively.
The forwarding predictive nature of the analysis would provide a streaming output of what is being "seen". In real-time it would be able to have a system of description for what is being seen. I see a dog walking. I see a car moving.
Think about the way Lidar and self-driving cars work today. The object is always the information in a reactionary sense of that moment of time. Is there a system of predictive analysis from live video streams and LLM thought today? I don't think so but I could be wrong. Again, I am not talking about rote prediction but rather prediction with information that is sensible. Moreover, if you could slightly predict and have analysis of the motion in a communication format it could serve many purposes. Self-driving cars and robotics come to mind. There could be many other applications.
Humans track this way as well, we call it, anticipation. Vision having anticipation is of a great need.
To summarize,
Computer vision alongside LLM analysis and predictive motion realisation in a real-time description stream of outputs.
[link] [comments]