I'm trying to understand how GPTs/LLMs work, on a conceptual level and using the correct terminology.
Here's my understanding so far (please correct if I'm wrong):
- GPTs are pre-trained so that for any given input it spits out the statistically best matching output based on its training.
- It does this token by token, without "understanding" the output, just that this token is often followed by this other token.
- It gains this knowledge during its training, when the LLM was fed a large number of embeddings (ie its "knowledge").
- A LLM can be fine-tuned after the training stage, which builds on its training data to become more accurate for a particular domain. This happens by feeding it domain-specific labelled data, and the model's parameters are modified to match the desired accuracy in the new data.
Here's the bit I don't understand about "memory".
Afaik, LLMs do not have long-term memory in the human sense (if I tell you I have a 6 year old son today, a year from now you would know little Billy is 7 years old).
So how are these models able to answer related follow-up questions in the chat?
eg
"tell me a story"
<some story>
"make it shorter"
<shortens the story>
- Is the application just passing the previous Q&A in the context window?
- Will the context window and number of tokens required just keep growing the longer the conversation proceeds?
- Are there architectures where the model queries some database ("select * from user_history") before answering? Is that what vector databases are used for?
- Or is there an architecture running a near-realtime fine-tuning of the model when the chat begins? Is that how those "speak with your PDF" apps work?
Feel free to be technical - I'm a software engineer, but a noob at the AI stuff.
[link] [comments]