Apple will unleash iOS with (GPT-4o) this week at WWDC 2024 – I still have questions about the 4o api docs and the demo – Is there a "Secret API" only Apple will have

There is something from the demo that still naws at me. Mira said, that the 4o model reasons across voice (audio), text, and vision (video).

I still don't see any indicators of this for api usage and consumption whatsoever.

Firstly, I am asking is this a model consolidations from an api perspective for creators or is this something internal availability only for ChatGPT-4o itself?

I will use audio and video as examples. Text has come with an iterative stream feature so this is the kind of feature set I am looking for that correlates with the demo and it's output capabilities.

Audio

Audio falls under Speech-to-Text (STT) and Text-to-Speech (TTS). In the case of this concern we are speaking to the 'whisper model' modality via the api docs and more specifically STT because that would be the input.

I'm not seeing anything coming from 4o in this regard. It is still a separate model that is the whisper model performing STT.

from openai import OpenAI client = OpenAI() audio_file= open("/path/to/file/audio.mp3", "rb") transcription = client.audio.transcriptions.create( model="whisper-1", file=audio_file ) print(transcription.text)

Would the expectation be eventually that it will no long be the whisper model separately and it would go through 4o?

But on the merits would it make any difference if this is a 1-to-1 name change in model only i.e. whisper-1 to gpt-4o? I would think that if we are really talking about something "omni" here the audio would give other characteristics beyond STT.

For example, is the person saying something angry or are they excited while speaking. Is the person anxious and in need of immediate medical or emergency services attention. Tonal characteristics could be important metadata about the incoming audio.

Moreover, the "omni" would suggest that not only the audio file coming in would give an STT function but wouldn't the model also come back with a response all together?

So, you give me audio and I return to you an entire response without making an additional call. Isn't this truly what Mira was referring to when she made the statements that it can reason over all formats with 1 model and this really reduces latency.

If I recorded myself saying, "Hi, I am wondering how many elements are in the periodic table of known elements". Then I sent the audio file (or stream) to GPT-4o that I would get a response that wasn't just [[Audio -> STT] -> TTS] but rather [Audio -> Audio] all in 1 shot.

In the middle of [Audio -> Audio] I would imagine having a payload accompanying the returning audio of

STT
TTS
Tonality metadata
other metadata
Audio File

Vision

Vision is a little different but still similar to audio. With vision it is more complicated because you have the nature of video existing as many bundled individual image frames in time series. Or simply, you could have a single image.

As well, vision has another important complication. It doesn't come with the notion of an inherent question built into the visual point of interest. Not everything in your vision is worth talking about which is very different than words coming out of somebody's mouth. So, with this the audio / text components that accompany visualizations are important co-collaborators as is with human interactions about a visual conversation.

In this, the vision components need to be accompanied by text components. You can then go to the [Audio -> Audio] output. It would look something like this. [Vision + Audio -> Audio]

In this way the Vision is there and the audio is something that is added post vision about something that is available from an image or series of images over a period of time.

If you remember in one of the demos it was particularly difficult for the model to "line-up" the visual media with the prompt query of the prompter. If I remember correctly, there was a time that GPT responded with saying it saw a brown table which was something that was a few seconds earlier as opposed to the current time frame of the user. Again, not a knock on the demo just an immensely difficult set of engineering tasks going on all at once.

In the middle of [Vision + Audio -> Audio] I would imagine having a payload accompanying the returning audio of

STT
TTS
Vision Transcription -> What was the analysis of the images/video used in the process
Vision metadata -> this would line up prompting STT with visual components for analysis. i.e., this grouping of images came across this text prompt... something of that nature
Tonality metadata
other metadata
Audio File

Now, I am asking for these things as an end user of the open ai api's for the purpose of development needs. To Mira's point was excited because I thought that the api's would represent this new world of development and capability.

I imagine this with GPT-4o

[Audio -> Audio]

[Vision + Audio -> Audio]

As of now, we don't seem to be getting anything like this. Everything is effectively still separate. I can build all of the things I am speaking about on my own but that just makes 4o an smaller, lighter, cheaper model compared to 4. There's really no "o" in it. Again, from a developers perspective.

So how does Apple maybe fit into all of this? I have a strong suspicion that we are going to see the Apple WWDC express more capable features like I am supposing here that are going to be miraculously baked into the iOS SDK.

If this is the case, and only Apple and Microsoft effectively get those tools and I am reaching here I don't know exactly how WWDC is going to express the capabilities for devs regarding a surprise announcement for OpenAI, that would be really disappointing for developers. I really don't know. BUT, if I see that the iOS sdk is way more capable and related to my wish list above that is going to IRK the hell out of me.

The implications would be that you can build in an "omni" way for iOS but not as an individual developer. In reality ChatGPT-4o is an update that has a "secret" api that is omni perhaps but I am not seeing that flush out to the end using developer. Either, it is a secret api that is not released or it isn't "omni" by any means.

submitted by /u/Xtianus21
[link] [comments]