GPT4 is 8 x 220B params = 1.7T params

For a while we’ve been were hearing rumors GPT-4 is a trillion parameter model. Well in the last week some insiders have shed light on this.

It appear the model is actually a Mixture of Experts (MoE), where each of the eight experts has 220B params, totaling 1.7T parameters. Interestingly, MoE models have been around for some time.

So what is a MoE?

Most likely, the same data set was used to train all eight experts. Even though no human specifically allocated different topics, each expert could have developed a unique proficiency in various subjects.

This is a little bit of simplification, since currently the way the experts specialize in tasks is pretty alien to us. It’s likely there’s a lot of overlap in expertise.

The final output isn't merely the superior output from one of the eight experts; rather, it is a thoughtful amalgamation of the insights from all the experts. This blending process is typically managed by another, generally smaller, neural network, which determines how to harmoniously combine the outputs of the other networks.

This process is typically executed on a per-token basis. For each individual word, or token, the network utilizes a gating mechanism that accounts for the outputs from all the experts. The gating mechanism determines the degree to which each expert's output contributes to the final prediction.

These outputs are then seamlessly fused together, a word is chosen based on this combined output, and the network proceeds to the next word.

Why the 220B limit?

The H100, a $40,000 high-performance GPU, offers a memory bandwidth of 3350GB/s. While incorporating more GPUs might increase the overall memory, it doesn't necessarily enhance the bandwidth (the rate at which data can be read from or stored). This implies that if you load a model with 175 billion parameters in 8-bit, you can theoretically process around 19 tokens per second given the available bandwidth.

In a MoE, the model handles one expert at a time. As a result, a sparse model with 8x220 billion parameters (1.76 trillion in total) would operate at a speed only marginally slower than a dense model with 220 billion parameters. This is because, despite the larger size, the MoE model only invokes a fraction of the total parameters for each individual token, thus overcoming the limitation imposed by memory bandwidth to some extent.

If you enjoyed this, follow me on my twitter for more AI explainers - https://twitter.com/ksw4sp4v94 or check out what we’ve been building at threesigma.ai.

https://preview.redd.it/fxyuta2rwk8b1.png?width=854&format=png&auto=webp&s=73a0243bd11a9e819dcd4f6267c7c9db9f33f10c

submitted by /u/serjester4
[link] [comments]