I. The "Attention is All You Need" Paradigm
Before 2017, sequence modeling relied on Recurrent Neural Networks (RNNs) and LSTMs, which processed data sequentially. As established in the seminal paper by Vaswani et al. (2017), the Transformer moved to a purely attention-based architecture, allowing for massive parallelization and the modeling of long-range dependencies without the "forgetting" problem of recurrent cells.
The core insight is that the "recurrence" bottleneck could be replaced by a mechanism that allows every token to directly "attend" to every other token in a sequence, regardless of distance. This shift is what enabled the scaling of models to trillions of parameters.
Parallel Dispatch
Every token in a sequence attends to every other token simultaneously, breaking the O(N) bottleneck of RNNs.
Primary Sources & Further Reading
- Vaswani et al. (2017). Attention Is All You Need. (The foundation).
- Jay Alammar. The Illustrated Transformer (Visual guide).
- Karpathy, A. Let's build GPT: from scratch, in code, spelled out.
- Su et al. (2021). Roformer: Enhanced Transformer with Rotary Position Embedding.
- Kaplan et al. (2020). Scaling Laws for Neural Language Models.
- Shazeer, N. (2020). GLU Variants Improve Transformer.