Reverse engineering GPT-4o image gen via Network tab – here’s what I found
Reverse engineering GPT-4o image gen via Network tab – here’s what I found

Reverse engineering GPT-4o image gen via Network tab – here’s what I found

Reverse engineering GPT-4o image gen via Network tab - here's what I found

I am very intrigued about this new model; I have been working in the image generation space a lot, and I want to understand what's going on

I found interesting details when opening the network tab to see what the BE was sending - here's what I found. I tried with few different prompts, let's take this as a starter:

"An image of happy dog running on the street, studio ghibli style"

Here I got four intermediate images, as follows:

https://preview.redd.it/af6usgcurdre1.png?width=2048&format=png&auto=webp&s=1dd11d3982699203f3f7a22dfa11cedeab794145

We can see:

  • The BE is actually returning the image as we see it in the UI
  • It's not really clear wether the generation is autoregressive or not - we see some details and a faint global structure of the image, this could mean two things:
    • 1. Like usual diffusion processes, we first generate the global structure and then add details
    • 2. The image is actually generated autoregressively

At first I was believing on the 2nd option, but thinking about it I am not so sure. If we analyze the 100% zoom of the first and last frame, we can see details are being added to high frequency textures like the trees:

https://preview.redd.it/73iaslc8tdre1.png?width=2608&format=png&auto=webp&s=e242c23e3199a5a1e153c27f3c0f13002751fdc0

This is what we would typically expect from a diffusion model. This is further accentuated in this other example, where I prompted specifically for a high frequency detail texture ("create the image of a grainy texture, abstract shape, very extremely highly detailed")

https://preview.redd.it/5lkitdrntdre1.png?width=2048&format=png&auto=webp&s=771fc7845991b6f84870276f516d4991f62271fd

Interestingly, I got only three images here from the BE; and the details being added is obvious:

https://preview.redd.it/aof4zufwtdre1.png?width=2058&format=png&auto=webp&s=0a0d8e714a094466640d0dde3f3fba9093622b41

This could be done of course as a separate post processing step too, for example like SDXL introduced the refiner model back in the days that was specifically trained to add details to the VAE latent representation before decoding it to pixel space.

It's also unclear if I got less images with this prompt due to availability (i.e. the BE could give me more flops), or to some kind of specific optimization (eg: latent caching).

So where I am at now:

  • I am inclined to think that the generation is still based on diffusion primarily
  • There might be some refiner model as post processing in place
  • I think the real difference comes from how they connected the input and output space of images and text; it makes me think of this recent paper: OmniGen

There they directly connect the VAE of a Latent Diffusion architecture to an LLM and learn to model jointly both text and images; they observe few shot capabilities and emerging properties too which would explain the vast capabilities of GPT4-o, and it makes even more sense if we consider the usual OAI formula:

  • More / higher quality data
  • More flops

The architecture proposed in OmniGen has great potential to scale given that is purely transformer based - and if we know one thing is surely that transformers scale well, and that OAI is especially good at that

What do you think? would love to take this as a space to investigate together! Thanks for reading and let's get to the bottom of this!

submitted by /u/seicaratteri
[link] [comments]