Infinite context windows? Streaming LLMs can be extended to infinite sequence lengths without any fine-tuning.

LLMs like GPT-3 struggle in streaming uses like chatbots because their performance tanks on long texts exceeding their training length. I checked out a new paper investigating why windowed attention fails for this.

By visualizing the attention maps, the researchers noticed LLMs heavily attend initial tokens as "attention sinks" even if meaningless. This anchors the distribution.

They realized evicting these sink tokens causes the attention scores to get warped, destabilizing predictions.

Their proposed "StreamingLLM" method simply caches a few initial sink tokens plus recent ones. This tweaks LLMs to handle crazy long texts. Models tuned with StreamingLLM smoothly processed sequences with millions of tokens, and were up to 22x faster than other approaches.

Even cooler - adding a special "[Sink Token]" during pre-training further improved streaming ability. The model just used that single token as the anchor. I think the abstract says it best:

We introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence length without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more.

TLDR: LLMs break on long convos. Researchers found they cling to initial tokens as attention sinks. Caching those tokens lets LLMs chat infinitely.

Full summary here

Paper link: https://arxiv.org/pdf/2309.17453.pdf

submitted by /u/Successful-Western27
[link] [comments]