Cheaper, Faster, Better Transformers. ELiTA: Linear-Time Attention Done Right

Yes, it's another Transformer architecture that seeks to be cheaper and faster, but no, this is not the same. All the developments are through equations and architectural changes, no hardware or code tricks. The performance is very good, testing on very small models (as in the diagram), but also sequence lengths of 100K+ on 1 GPU in the tens of millions of parameters. Though no paper is currently available, a Github repository with full code, explanations, intuitions, and some results is available here. Being the sole author, depending on the feedback here, I may continue to write a paper, though my resources are extremely limited.

I would very much appreciate any feedback on the work, code, ideas, etc., or for anyone to contact me with questions or next steps.

Repository here.

submitted by /u/LahmacunBear
[link] [comments]