I try to understand GPT-3/4 conceptually. Not enough coding knowledge yet to understand it from code.
Simple question:
I know that GPT outputs one token (distribution) at a time and is the fed the result, thus giving the next token and so on.
But is every iteration a "blank slate" of the model, or is it able to keep information stored between token generations?
Example: I1) Input sequence: "My cat" Next token: "is" I2) Input sequence: "My cat is" Next token: "furry".
-> Is GPT in the same initial state when it receives "My cat is" as it was when it got "My cat"?
Also, apart from the residual stream, what parts of GPT can memorize?
[link] [comments]