For whatever reason, I was thinking about two speed reading techniques and a study about linguistics and the philosophy of communication and LLMs, as one's brain does while laying in bed...
Compressed text form (e.g Lil word need t'undest) and word context cloud, where you isolate important and connective words just enough to reconstruct meaning.
As I understood as a lay person with some minor knowledge in programming and a TON of the tism, LLMs currently store and load all tokens of a conversation linearly as conversation flows, reprocessing every previous token every time.
This is mainly what drives up the cost of RAM and compute.
Or so I think it is.
I both studied and taught English as a second language and know a bit about speed reading and memorization techniques which I do in both languages I know to a degree.
Not an expert in it, but curiosity definitely traps.
My theory involves using essentially a compressed text word vector cloud.
A cluster of shortened merging tokens and vectors to reuse tokens into building context connections in a way that information can be reconstructed for the whole text.
This would allow much more efficient use of RAM to store much denser clouds of information.
There is also an important thing that for each language, several tokens and nodes would probably be reusable for 99% of texts.
This is around how I as a bilingual person who speedreads sometimes with those two techniques understand that linguistics, memory and communication work when stripped to its barebones.
Hierarchical clouds and pruning also seem efficient, effective and around how human memory works, with short and long term memory access of only needed neurons.
Clouds of tokens like "be, do, act, a, one" and other such extremely common information are much more used than token nodes like "sword, spear, axe, shield" and others.
Someone talking about a large hadron collider would not need a cloud for medieval weapons.
Having the node clouds split into several multi-hierarchical tier clouds and pruning what isn't used or needed for the context, only reloading later if it becomes needed would severely reduce the number of tokens needed to be loaded, leaving a lot more space for vectors, which further makes it efficient.
Is this actually meaningful or useful?
I mean, you yourself probably skimmed the first long few paragraphs to the heart of the matter, no?
[link] [comments]