Ok, I wanted to give this a fair go and my first impressions are not good. I am not impressed.
I did an AB evaluation from GPT 4 on one side of questions and Bard's new Gemini on the other side.
A little TLDR upfront; Bard seemed to diverge constantly into a different of Q&A because it was so far off track and that was really a surprise.
Also, I will not provide specific results because Google has stated that they are monitoring everything that is going through as a disclaimer. It's not my job to help them quit frankly.
What I found in comparison and why I think it's very telling they didn't release Ultra up front. I also, cleary can see why they did not release Ultra now; There's no possible way that would be any better and would have received very bad results.
Last thing before we get started. Google, through marketing efforts started releasing all of these analytics and metrics of why Ultra is better at performing certain tasks and results. Great, but A, they didn't release that model to the public and B, when speaking about AGI I think the public's observation will be critical rather than some public STEM style tests. This goes for any model. Why? Well, just like kids in school you can train to the test and get good results it doesn't mean anything if everything else you do is not great.
The test comparison for reference is related to software engineering and programming (bug fixing and finding through a complex system).
Let's start. Warning, this is from the perspective of a SME power user that is concerned with enterprise implications.
---------------------------------------------- Review of Bard's Gemini Pro ------------------------------------------
- It hallucinates badly(D+): It is akin to GPT 2+ rather than GPT 3+ Let alone GPT 3.5 or 4. The hallucinations seem like it struggles mightily with any real reasoning capability. The Reasoning you experience even in GPT 3.5 is leaps and bounds more accurate than where Bard is right now. Where one would take a context in 2 or 3 layers and give an accurate and coherent response bard just gives up, responds with factual incorrect responses and states them as fact.
- If reasoning is the prime strength of GPT 4 - Bard seemingly doesn't have this capability to reason with layers of scope to obtain the correct response. Think Chain of Thought or better yet, Chain of Reasoning CoR. Meaning, I can hold these concepts in my mind thinking about each one and how I can eventually come to a conclusive answer about the entire scope of thought.
- The citations are ridiculously bad (D): Not only is it giving incorrect information it's giving sources and citations that literally don't have any information about what was actually queried in first place. So if one thought that the training was from that source that's not true. And, which is hilarious, google search works kind of like this which makes me wonder if they're trying to bolt on the same technology here. It's really concerning if that's the case.
- How much is Google search embedded and assisting Google Bard Gemini? This, to me, is not a good path forward if this is the case. It may have gotten Gemini to an early release but the end result is not to be desired.
- The source information is so wrong I would warn Google to seriously rethink this strategy. Either you're admitting your training data is wildly off the mark and or there is such a dissociation from what they're telling us is the source versus what they are actually just parroting out it is that the sources are useless and NOT A PROOF OF WORK.
- I asked Bard a simple question of what was the latest version and it just tripped up all over the place. (this is the only clue I am giving.) Everything about it was wrong. The source, the suggested links and the version.
------------------------- Google Search Analysis In relation to Bard/Gemini -----------------------------
I have to break out of the review for a moment because I want to address the Google search issue. Google search has been met by industry complaints (think advertisers) that it has an experience where you don't ever leave the google page. Now, this isn't right or wrong it's just how it works. So if you do a query google does this thing where they try to highlight an answer to you in text with bold words to emphasize the appearance of I have your answer right here. It's kind of like a proof of search if you will. Sometimes it's great and other times it's way off of the mark.
In an odd way, Gemini pro and it's citations (and information) almost has the same effect. It's as if they're using that engine to prompt adhere your prompt and then come up with an response that is often off the mark.
It's almost like a different kind of hallucination where the source information is way off the mark so the response is way off the mark. That's my impression of it.
Then, when Bard suggests links those seem to be a straight shot in the dark. The information is often something that is totally unrelated. It's really bad. A manual google search is 10X's more useful than the links Bard is suggesting. It's not even the literal top google search results. From this I know that Bard is not really analyzing those results and they are just boot strapping a version of Google search to bring back seemingly random links that are more title based rather than usefully knowledge based.
To be fair, this is not something GPT 4 does well either but GPT 4 comes back right away and says yea, I didn't find anything useful from what it searched related to my question set. It admits right away that it can't find the information being asked.
LOL, can we teach these AGI's how to search - It's a useful skill that is tricky (as we're realizing).
In summary about the way Bard is handling search and results of useful information is not good at all. The fact that this seems like a core engine from them is a dangerous game they're playing because it seems like an obsolescent crutch that could come back to bite them if this is the road that they are going down.
I hope to god Ultra is not going to work in this way because the results will not be good.
------------------------- End search analysis: resume review ---------------------------------------
------------------------- Resume Review --------------------------------------------------------------------
- Response Style (STOP TALKING) (F): To be fair GPT 4 struggles with this mightily (but eerily seems to be getting better). This is the where a knowledgeable SME asks something and the chatbot starts vomiting out a bunch of information. Oh, I absolutely hate this. I am asking for something specific. Either you know or don't know. You providing every G**D*** detail over and over again drives me literally nuts. I am asking for specific information and I want a pointed response. This illusionary smoothing through "more content" is currently an industry struggle right now. It's like there is a telmetric threshold of "I am not too sure about this answer so start injecting in CoT and just break everything down so that perhaps I can reason to the right answer." I don't want to experience that out all of the time. If I ask you for a proof of work or give me your reasoning then that's different. If I am asking you a pointed question I don't need a dissertation. The proverbial "Less is More" if you will.Both GPT and Bard gets F's for this. 5. Presentation of Response and Coherence (A): What can you say the responses (stylistically) are good. LLAMA, Claude and GPT have all achieved this capability. The grammar is good, the writing style is very good. It's just wrapping incorrect information but it looks nice; so, there's that. 3. Usefulness (D+): I can't just keep dolling out F's here but for me I can't take this seriously and have it as a main driver because it doesn't achieve the same results as GPT-4. In my chain of questioning (or shots). I just feel like it starts outputting such poor information in it's responses that are so off and wrong that I just don't trust it. This is where GPT 4 really shines. The information it responds to you with is such quality that it is very reliable. When it doesn't know or gets something wrong the way it handles it is much better and easier to notice. The hallucinations are creeping there way out of GPT while the pain of hallucinations are right up front and center with Gemini's Bard.
- Me being a SME in the field of my prompting allows me to notice when something is on the ridiculous faster. It's the feeling of "what are you talking about. and that can't be possible." when asking something and seeing the response Bard gives out.
- STEM Teaching to the Testing (F): When I teach my son, much to his mother's shagrin, I spend extra time with him to go over concepts and foundational understanding. When he gets an A in math I am part of the reason. Why do I know this? Because when he comes to me and doesn't understand it's my job to figure out the parts of the foundation that he doesn't understand so we can focus on those parts. If you can't foundationally understand something you will have a ripple chain effect of not being able to do something that is about that subject matter or an extension of that subject matter. This is the proverbial, throw the entire thing away. Google should be very careful with this and so should any aspiring AGI world builder including GPT. Think of it this way. Will the world and our understanding of how AGI works today be starkly different 25 - 50 years from now? This is the quintessential question. If you are going down the wrong path it could set you back for years/decades. When teaching to a STEM test to get bragging results be careful you are not just shooting your shot for quick paper reviews that seem more marketing then they can possibly mean substance. Rather than teaching to the test make damn sure this can work overall in a general sense. Make sure the foundation is sound. Do not train or "teach" to the test.
- If Google is just showing us Ultra results but there is a `Wizard of Oz` effect here they will be punished when they finally do release Ultra the public will not be kind. This could set them back for years and this factually may already be the case. Where is Gemini Ultra is going to be the increasing refrain because of just how incapable this is in today's form.
- Missing Parts "Where is Gemini Ultra" (D): I've seen Google do this before. Remember the demo where they had a call to a hair salon and everyone that that was the bees knees? Remember how that doesn't even exist today. To many times google has demoed something and it has not panned out. The risk here is monumental. They showed us something on one hand with score metrics and demos but they oh so slickly held out on releasing any of that for the lesser of now. If Sam Altman famously said "Where is Gemini" I think now the wording can be "Where is Gemini Ultra." With all of the above analysis I am very skeptical of the efficacy of Gemini's Ultra. Will it be on par with GPT 4 or not? This relates to the above point/analysis. If these infractions make there way into Ultra it will be an epic dud. Obviously, this is why Google released Gemini Pro first in order to get feedback, data and analysis they need to even make Ultra come into fruition. However, I'd advise caution. This goes back to the foundational roots. If you're doing something bad now what do you expect to do when you amplify that effect with a larger model? GPT met that challenge going from 3.5 to 4. Will Gemini have the same effect? I am skeptical and this is an opinion but from what I am seeing with all of the points I made above I am not sure.
- Vision looks cool, where is it. GPT 4 has vision now for my enterprise needs.
- Data analysis, GPT 4 has this now.
- Text to Speech/Speech to Text (Google has to get an A here because of Youtube.) They can't possibly lose this race but where is it? Azure has fine applications in this space that are top tier so...
- Enterprise Usefulness and Usage(D): Keep in mind I am speaking about Gemini Pro and not Ultra because I can't review that as of yet. Here's the thing. I would in no way choose to use bard over any of the models I am using now. In AI model/application building there are different tiers of modeling you think about when using the AI models. You have custom trained models for some things that are cheaper and more pointed so they're efficient. Or you need to bring out the Lamborghini (GPT 4) for the final layer of reasoning and thought to make your final result (magical). As of today, I just don't see where Gemini fits into this. It's not open source and it's not great. There is a lot to be desired in the space that Gemini is filling. As of now, it doesn't have a space for me and that's the issue. Where does this fit in. As of today nowhere.
In summary, for me, Gemini in comparison to GPT-4 (and even 3.5) is not getting good marks. There is a chance of them delivering on Ultra but until then... Where is Ultra as I am not entertained/impressed. Google has a track record of underwhelming on official release. In a way, they released this and it is OK for 90% of people but for the power user (Engineers, SME's, Architects, Scientists) who are expecting an AGI look and feel; This ain't it. What's more concerning is that there seems to be some foundational things that will not scale well unless they vastly improve. Let's see.
And I want to be fair, for the occasional user, the none enterprise automation world builder user this may seem cute and cuddly and well presented. And that's ok it's something to build on. The low grades here do not mean in anyway that they can't come out swinging on Ultra and impress the hell out of me then.
For now, it's just going to have to be. Where is Gemini Ultra.
Final Grades:
Power User: D+
Casual User: B+
[link] [comments]