So a video popped up from Wes Roth that I started watching, by the way I realy like the way Wes goes through his explanations because they're clear and concise. Unlike me ;-P. While watching it I was like hmmm. That paper has diagrams that look pretty familiar. OK. They're planning the World View Foundational Model. Here's what I posted some time ago for reference. That W* is exactly an Interactive Agent Foundation Model. That's what that means. Now, look at this. YES! I love it. I should have added empathy, how can you not have empathy. Agent observations is the Worldview Situational Stimuli. It's THIS. I would love to work on the memory portion of this. Ok let's go into a little bit of exactly what Microsoft is saying here. Before we even go there. Look at the Stream of Thoughts concept. People are freaking out about the outward projections of video that we get to see but remember that SORA is seeing this within. In a way it's streaming a coherent system of actual thoughts about a world system. Microsoft says Agent-Observation and Perception. That IS literally situational Stimuli. This isn't me or speculation they are saying THINKING, EMPATHY, SENSE<<<, CONSCIOUSNESS. If they are building this like this I am with Lex at this point. Who are we to say it's not conscious? Ok, let's go back to what Microsoft is saying about memory here because that is a major issue that needs a proper solution.
So unfortunately they don't really go into much detail about Memory and persistence per se. My model is all about creating a method in which you can localize and create dynamic memory to interact with said foundational models. They go into section 4.2 to talk about a Pre-Training Strategy where they have interactions with video and conversation / actions and notate those and train said model. In section 5 Tasks, they talk about
So what they're saying is. When you make a model multimodel in GENERAL it performs well across the board. Sam literally mentioned this in his recent talks. They actually test this against GPT-4V.
I wrote about this in here Singularity and what I experimented with is trying to get the LLM to be the thing that can predict next actions and it didn't go well. I posted about Vision of Thoughts here (VOT) 2 months ago. Microsoft calls this Visual Language Models <<< This is HUGE! https://www.reddit.com/r/artificial/comments/18fa7x6/vision_of_thoughts_vot_a_light_proposal_for/ I tried to get GPT-4 to understand multiple images in a sequence from the perspective of physics and movement so that it could predict the next action in the scene. However, GPT-4 was not good at gaining that coherent nuance so I abandoned the idea. I gave it a good fight too with an overly detailed prompt and math and the whole 9 yards but it just wasn't able to just have that human level understanding and "anticipation" of what to expect next or "things in motion" like a video. Going back to Microsoft's paper section 7. Ablations and Analysis it is clear that they too came across the same thing of not finding that path feasible of using only GPT-4V computer vision. Instead they use gaming of Minecraft and Bleeding Edge to have a finer grained control with Text instruction whilst leading to a better predicted action and ground truth action data set. In section 6.4 Healthcare Experiments they use a healthcare dataset and evaluate the model's ability on 3 separate downstream tasks: video captioning, visual question answering, and activity recognition <<<< PREDICTION/ANTICIPATION in the form of RASS score prediction. So back to section 7: they conclude
Again, it can't be emphasized enough. An agent, trained with multi-stimuli including that from video & real world stimuli can produce a better overall Agent AI. They do say that this does NOT improve text generation abilities and that's ok who would've thought that anyway. However, action recognition is important/amazing in it's own right. Think of it as a specific language for video analysis that the agent understands. As long as that form of communication can make it back to query/prompter in the form of language that's all that's needed. This will be easy for the a shot mechanism or just out right training to recognize that communication would need. I wish they would have spoken more about that particular part. There impact statement is lol Chef's Kiss! I am just going to leave it at that. THANK YOU MICROSOFT. I GOT IT. This Paper is A++++++ To bring it all home of why I am so excited about AGI being a real obtainable thing VIDEO is the KEY here and MEMORY. Starting with video being able to understand the visual coherence of what you see is just a leap in true cognitive ability. Microsoft says it too. It's not just me being hyperbolic Microsoft is saying it themselves.
Agent AI is emerging as a promising avenue toward AGI.
What they're saying is don't use LLM's to just CV your way into recognizing objects and actions and that is what this paper is all about. I wish they would have touched on 2 additional topics however.
I believe the key to this all will be how we can use local edge devices that can be utilized to train nano-models for memory that can speak to and communication with these other models for things like context, preferences and in general understanding the Worldview Stimuli of new situations and experiences. True AGI will not be done without truly coherent memory function. What's scary is that OpenAI releasing SORA is just all of this paper on a whole new level jaw dropping excitement because it may be that a very powerful model that is showing us video right now is completely capable of understanding coherently the world around it. Think about that. :| [link] [comments] |