After SORA I am Starting To Feel the AGI – Revisiting that Agent Paper: Agent AI is emerging as a promising avenue toward AGI – W* Visual Language Models

So a video popped up from Wes Roth that I started watching, by the way I realy like the way Wes goes through his explanations because they're clear and concise. Unlike me ;-P.

While watching it I was like hmmm. That paper has diagrams that look pretty familiar.

OK. They're planning the World View Foundational Model.

Here's what I posted some time ago for reference. That W* is exactly an Interactive Agent Foundation Model. That's what that means.

https://preview.redd.it/oxru0uf496jc1.jpg?width=6477&format=pjpg&auto=webp&s=f7072dae4e23cb2d42170eccc95b6f49e4ee5b58

Now, look at this. YES! I love it. I should have added empathy, how can you not have empathy.

https://preview.redd.it/cl6jxa9896jc1.jpg?width=1066&format=pjpg&auto=webp&s=85a6807786f804a32aa0fe39693251688fa90f4a

Agent observations is the Worldview Situational Stimuli. It's THIS.

https://preview.redd.it/6hgw84r996jc1.jpg?width=6456&format=pjpg&auto=webp&s=8a0b43ece56b79786a076ca200e46b083ac74e61

I would love to work on the memory portion of this. Ok let's go into a little bit of exactly what Microsoft is saying here. Before we even go there. Look at the Stream of Thoughts concept. People are freaking out about the outward projections of video that we get to see but remember that SORA is seeing this within. In a way it's streaming a coherent system of actual thoughts about a world system.

Microsoft says Agent-Observation and Perception. That IS literally situational Stimuli. This isn't me or speculation they are saying THINKING, EMPATHY, SENSE<<<, CONSCIOUSNESS.

If they are building this like this I am with Lex at this point. Who are we to say it's not conscious?

Ok, let's go back to what Microsoft is saying about memory here because that is a major issue that needs a proper solution.

Perception that is multi-sensory with fine granularity. Like humans, multi-sensory perception is crucial for agents to understand their environment, such as gaming environments, to accomplish various tasks. In particular, visual perception is useful for agents that can parse the visual world (e.g., images, videos, gameplay).
Planning for navigation and manipulation. Planning is important for long-range tasks, such as navigating in a robotics environment and conducting sophisticated tasks. Meanwhile, planning should be grounded on good perception and interaction abilities to ensure plans can be realized in an environment.
Interaction with humans and environments. Many tasks require multiple rounds of interactions between AI and humans or the environment. Enabling fluent interactions between them would improve the effectiveness and efficiency of completing tasks for AI.

So unfortunately they don't really go into much detail about Memory and persistence per se. My model is all about creating a method in which you can localize and create dynamic memory to interact with said foundational models.

They go into section 4.2 to talk about a Pre-Training Strategy where they have interactions with video and conversation / actions and notate those and train said model.

In section 5 Tasks, they talk about

We believe that a foundational model, trained in visual, language and agent capabilities, leads to a powerful and general-purpose tool that significantly impacts a variety of interactive tasks.

To evaluate the effectiveness of our approach, we applied the model to three major agent-AI scenarios, encompassing representative downstream tasks: 1) Robotics: human-machine manipulation in the physical world; 2) Gaming: human-machine embodiment in virtual reality; 3) Healthcare: augmented human-machine interaction in traditional multimodal tasks. For these tasks, the pre-trained model was fine-tuned with specific datasets. As a result, the model demonstrated reasonable and competitive performance in terms of action prediction, visual understanding, natural language-driven human-machine interactions, gaming, and hospital scene understanding. We outline the task definitions and specific datasets used below.

So what they're saying is. When you make a model multimodel in GENERAL it performs well across the board. Sam literally mentioned this in his recent talks.

They actually test this against GPT-4V.

Ablations and Analysis: Comparisons with GPT-4V: In Figure 10, we show how our model has the ability to output low-level action predictions, while GPT-4V is unable to consistently output low-level controls. While our model is able to output precise movements and actions, GPT-4V only outputs high-level instruction.

https://preview.redd.it/8uti0m7e96jc1.jpg?width=1066&format=pjpg&auto=webp&s=bfa73789024446c8d28e4669f611be07b87a503b

I wrote about this in here Singularity and what I experimented with is trying to get the LLM to be the thing that can predict next actions and it didn't go well.

I posted about Vision of Thoughts here (VOT) 2 months ago. Microsoft calls this Visual Language Models <<< This is HUGE!

https://www.reddit.com/r/artificial/comments/18fa7x6/vision_of_thoughts_vot_a_light_proposal_for/

I tried to get GPT-4 to understand multiple images in a sequence from the perspective of physics and movement so that it could predict the next action in the scene. However, GPT-4 was not good at gaining that coherent nuance so I abandoned the idea. I gave it a good fight too with an overly detailed prompt and math and the whole 9 yards but it just wasn't able to just have that human level understanding and "anticipation" of what to expect next or "things in motion" like a video.

https://preview.redd.it/57bvm0jf96jc1.jpg?width=2026&format=pjpg&auto=webp&s=4b76b7860070d0719f2e7c3ac2f34ca2036f084e

https://preview.redd.it/lk0pj76g96jc1.jpg?width=688&format=pjpg&auto=webp&s=0add79e3b20305d77dff0052d5164299344c6cd2

https://preview.redd.it/7e251ukg96jc1.jpg?width=690&format=pjpg&auto=webp&s=286520a8cdb07c0b6688f71b72e5e1b12eb743a5

Going back to Microsoft's paper section 7. Ablations and Analysis it is clear that they too came across the same thing of not finding that path feasible of using only GPT-4V computer vision.

Instead they use gaming of Minecraft and Bleeding Edge to have a finer grained control with Text instruction whilst leading to a better predicted action and ground truth action data set.

https://preview.redd.it/60t9w2sh96jc1.jpg?width=1086&format=pjpg&auto=webp&s=b42879cd30facd54ea3f0ff0c8f3b30e24fa48e9

In section 6.4 Healthcare Experiments they use a healthcare dataset and evaluate the model's ability on 3 separate downstream tasks: video captioning, visual question answering, and activity recognition <<<< PREDICTION/ANTICIPATION in the form of RASS score prediction.

So back to section 7: they conclude

Effects of Agent Pre-Training: In Table 2 and Table 4, we demonstrate the effectiveness of our agent pre-training strategy compared to training from scratch and training against an equivalent visual-language baseline. In particular, we show that a commonly used approach for fine-tuning visual-language models by using frozen visual encoders, similar to LLaVA (Liu et al., 2023) or Mini-GPT-4 (Zhu et al., 2023), performs worse than joint fine-tuning for action recognition on our healthcare dataset. Furthermore, our agent pre-training boosts performance for action prediction across all gaming and robotics datasets.

Again, it can't be emphasized enough. An agent, trained with multi-stimuli including that from video & real world stimuli can produce a better overall Agent AI. They do say that this does NOT improve text generation abilities and that's ok who would've thought that anyway.

However, action recognition is important/amazing in it's own right. Think of it as a specific language for video analysis that the agent understands. As long as that form of communication can make it back to query/prompter in the form of language that's all that's needed. This will be easy for the a shot mechanism or just out right training to recognize that communication would need. I wish they would have spoken more about that particular part.

There impact statement is lol Chef's Kiss! I am just going to leave it at that. THANK YOU MICROSOFT. I GOT IT.

This Paper is A++++++

To bring it all home of why I am so excited about AGI being a real obtainable thing VIDEO is the KEY here and MEMORY. Starting with video being able to understand the visual coherence of what you see is just a leap in true cognitive ability.

Microsoft says it too. It's not just me being hyperbolic Microsoft is saying it themselves.

Figure 1. Overview of an Agent AI system that can perceive and act in different domains and applications. Agent AI is emerging as a promising avenue toward Artificial General Intelligence (AGI). Our model represents an initial step in the development of a model that is highly capable of human-level reasoning across many tasks and levels of granularity.

Agent AI is emerging as a promising avenue toward AGI.

the AI community has a new set of tools for developing generalist, action-taking AI systems en route to artificial general intelligence. Despite their impressive results across various AI benchmarks, large foundation models frequently hallucinate the presence of objects and actions in scenes and infer factually incorrect information (Rawte et al., 2023; Peng et al., 2023). We posit that one of the key reasons why these foundation models hallucinate is due to their lack of grounding in the environments in which they are trained (e.g., large-scale internet data instead of physical or virtual environments). Furthermore, the dominant approach for building multimodal systems is to leverage frozen pre-trained foundation models for each modality and to train smaller layers that allow for cross-modal information passing

What they're saying is don't use LLM's to just CV your way into recognizing objects and actions and that is what this paper is all about.

I wish they would have touched on 2 additional topics however.

How do you loop it back into the multimodal system of this communication can be used like this with a foundational LLM.
Memory

I believe the key to this all will be how we can use local edge devices that can be utilized to train nano-models for memory that can speak to and communication with these other models for things like context, preferences and in general understanding the Worldview Stimuli of new situations and experiences. True AGI will not be done without truly coherent memory function.

What's scary is that OpenAI releasing SORA is just all of this paper on a whole new level jaw dropping excitement because it may be that a very powerful model that is showing us video right now is completely capable of understanding coherently the world around it.

Think about that. :|

submitted by /u/Xtianus21
[link] [comments]