Google’s Gemini Breaks New Ground in Seamless Multimodal Reasoning

Google's recent unveiling of Gemini is sure to take the internet by storm!

It felt as though all hope was lost with the recent announcement of a delay until January of 2024, however on December 6th, Google had blessed us with quite the news… They posted multiple videos and blogposts showing the new AI models capabilities, I will provide you with key information from those mediums. After viewing most of them I’m quite impressed with Gemini and would like to share as to why!

Firstly, and most notably, Google had compared the Gemini model to GPT4-v and claims it outperforms GPT-4-Vision in 30 out of 32 benchmarking tests, including multi-discipline reasoning problems (MMLU Benchmark), image & document understanding, code generation, etc. While benchmarks have limits, this early success in measurable comparisons is still impressive even if it’s just by a few percent this still sets the model to finally de-throne OpenAI which is exactly what Google wants.

Gemini's robust multimodal architecture allowed them to train several modalities concurrently, departing from prior methods stitching separate models together. The model is engineered to seamlessly comprehend and utilize multiple data types concurrently (Text, code, audio, image, and video). Although this concept has been detailed before in research papers such as in the Any-to-Any Generation via Composable Diffusion (arXiv:2305.11846 [cs.CV]). Watching it feels as though this unified design affords distinct advantages over dividing and stitching perceptions. By gaining a more holistic perspective, the model seemingly achieves heightened reasoning abilities, to see for yourself, I recommend viewing the “Gemini: Google’s newest and most capable AI model”.

Google also released a blog post detailing ways we could use Multimodal prompting for Gemini, I highly recommend checking it out on the Developers Google Blog, to grasp the vast amount of new possibilities this model opens up. In the blog post they reveal Geminis deft comprehension across modalities and dynamic reasoning abilities which arise from text and image input. Simple prompts evaluate its graphical descriptions while ingenious puzzles test complex reasoning, special awareness, logic, and knowledge skills. Prototyping games in real-time exhibits another really cool take on how such tools could transform the creative process of many.

Not to mention, the short inference time of the text to speech the model utilizes in the “Hands-on with Gemini: Interacting with multimodal AI”, makes it that way more immersive, however I hope they allow us to customize the voice.

Finally let’s talk about AlphaCode2 which is bound to make some programmers really reconsider their life choices after its public release. This model is seemingly a massive increase in performance to its predecessor AlphaCode released almost exactly a year earlier on December 8th 2022, which scored 50% accuracy on competitive programming tasks. This time, AlphaCode2 achieved an astonishing 85%+ accuracy on competitive programming problems across Python, Java, C++ and Go codebases, far surpassing the predecessor with a 75% first-try solve rate on 200 Python functions that balloons to 90% after self-review.

We will have to wait and see how it fairs in comparison to GitHub Copilot and other coding assistants available today, but one can imagine that the level of comprehension Gemini displays as a byproduct of its fully multimodal architecture, will be a massive contributor to its efficiency. More details about the Code model could be found in the video: “Gemini: Excelling at competitive programming”.

Overall, on paper the model seems like a very big step towards AGI, especially if it were to be implanted into robotic avatars such as for example Aura the Robot, stationed at the Sphere in Las Vegas. We will have to wait and see until it is made publicly available to confirm for sure. What a time to be alive…

Do you think this is overhyped and GPT4 will stand its ground against this competitor? Or maybe not? I cant wait to hear your thoughts on the matter!

Follow me for more relevant discussions and news about AI! ;)

submitted by /u/Ok-Judgment-1181
[link] [comments]