Please correct my understanding of "memory" in LLMs

I'm trying to understand how GPTs/LLMs work, on a conceptual level and using the correct terminology.

Here's my understanding so far (please correct if I'm wrong):

GPTs are pre-trained so that for any given input it spits out the statistically best matching output based on its training.
It does this token by token, without "understanding" the output, just that this token is often followed by this other token.
It gains this knowledge during its training, when the LLM was fed a large number of embeddings (ie its "knowledge").
A LLM can be fine-tuned after the training stage, which builds on its training data to become more accurate for a particular domain. This happens by feeding it domain-specific labelled data, and the model's parameters are modified to match the desired accuracy in the new data.

Here's the bit I don't understand about "memory".

Afaik, LLMs do not have long-term memory in the human sense (if I tell you I have a 6 year old son today, a year from now you would know little Billy is 7 years old).

So how are these models able to answer related follow-up questions in the chat?

"tell me a story"

"make it shorter"

Is the application just passing the previous Q&A in the context window?
Will the context window and number of tokens required just keep growing the longer the conversation proceeds?
Are there architectures where the model queries some database ("select * from user_history") before answering? Is that what vector databases are used for?
Or is there an architecture running a near-realtime fine-tuning of the model when the chat begins? Is that how those "speak with your PDF" apps work?

Feel free to be technical - I'm a software engineer, but a noob at the AI stuff.

submitted by /u/fartzilla21
[link] [comments]