LongLoRA: New method extends LLAMA2 7B to 100k context length, 70B to 32k context length on on a single 8 × A100 machine

As AI models get bigger, training them requires more and more computing power. Researchers are looking for ways to train these large AI models without needing Google-scale resources.

A new paper proposes LongLoRA, a fine-tuning approach that can extend LLaMA2 7B to 100k context length and 70B model to 32k context length on a single 8× A100 machine.

Here are my highlights from the paper:

Big one of course: LongLoRA efficiently fine-tunes large AI models on longer texts

Key points:

Approximates standard attention via "shift short attention" during training
Tuning only a subset of weights (LoRA) plus some embeddings & norms
Fine-tuned 7B parameter model on 100k tokens with 1 machine
Way lower training cost than full fine-tuning for large contexts
Close to full fine-tuning performance

The core insight is that an approximation of full attention enables efficient training while retaining standard attention for final inference. Combined with selective weight tuning, this really reduces compute needs.

I think this demonstrates the potential to train more capable AI without unreasonable resources. Efficient training techniques = more powerful LLMs for the same resources.

Full summary here.

submitted by /u/Successful-Western27
[link] [comments]