Transformers & Large Language Models

I. The "Attention is All You Need" Paradigm

Before 2017, sequence modeling relied on Recurrent Neural Networks (RNNs) and LSTMs, which processed data sequentially. As established in the seminal paper by Vaswani et al. (2017), the Transformer moved to a purely attention-based architecture, allowing for massive parallelization and the modeling of long-range dependencies without the "forgetting" problem of recurrent cells.

The core insight is that the "recurrence" bottleneck could be replaced by a mechanism that allows every token to directly "attend" to every other token in a sequence, regardless of distance. This shift is what enabled the scaling of models to trillions of parameters.

The Core Thesis Attention is all you need. By replacing recurrence with a weighted similarity mechanism, the architecture allows for efficient training on massive web-scale corpora.

Parallel Dispatch

Every token in a sequence attends to every other token simultaneously, breaking the O(N) bottleneck of RNNs.

II. QKV Logic: The Semantic Retrieval Analog

To understand Self-Attention, we use the analogy of a **Database Retrieval System**. Every token in a sequence generates three distinct vectors: a Query (what I am looking for), a Key (what I contain), and a Value (the actual information I offer).

In this framework, attention is a soft retrieval process. The Query of token A is compared (via dot product) against the Keys of all other tokens. The resulting scores determine how much of each token's Value should be aggregated into the final representation of token A.

$$ \text{Attn}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{QK}^T}{\sqrt{d_k}}\right)\mathbf{V} $$

Complexity Analysis: In a standard Transformer, the memory and compute complexity is **Quadratic** $ O(N^2) $ with respect to the sequence length $ N $. This is because the $ QK^T $ matrix has size $ N \times N $.

Comparison:

RNNs: $ O(N) $ - Linear sequential (hard to parallelize).
Transformers: $ O(N^2) $ - Quadratic parallel (easy to parallelize, hard to scale to long context).
Mamba/SSMs: $ O(N \\log N) $ or $ O(N) $ - The new frontier of sub-quadratic attention.

Attention Map Visualizer

Observe the probability distribution (Softmax) over a 6-token sequence. Higher intensity indicates stronger semantic 'focus'.

Head Temperature ($ \\tau $)

1.0

[Token Query 3 attends to Context]

III. The Transformer Block: Encoder & Decoder

The Transformer architecture is built upon a stack of identical blocks, each comprising multi-head self-attention mechanisms and position-wise feed-forward networks. The original paper introduced both an **Encoder** and a **Decoder** stack.

**Encoder blocks** process the input sequence, building rich contextual representations. **Decoder blocks** generate the output sequence, attending to both the encoder's output and the previously generated tokens (masked self-attention). Modern LLMs often use a decoder-only architecture.

Decoder-Only Architectures Many large language models (e.g., GPT, Llama) are decoder-only Transformers. They use masked self-attention to ensure that predictions for a token only depend on preceding tokens.

IV. Pre-Norm vs Post-Norm Topologies

The placement of **Layer Normalization** is critical for gradient stability. Modern systems (LLaMA/Mistral) utilize **Pre-Norm** with **RMSNorm** to ensure training can scale to hundreds of layers without gradient explosion.

LayerNorm

Normalizing activations per-sample across the feature dimension to stabilize the hidden state variance.

Residual Path

Identity shortcuts ($ x + f(x) $) ensure that gradients can bypass complex sub-layers, preventing vanishing signals.

V. Rotary Positional Embeddings (RoPE)

Absolute positional encodings (sinusoidal addition) fail to generalize to sequence lengths longer than those seen during training. RoPE (Su et al., 2021) encodes position not as an addition, but as a **rotation** in the complex plane.

By rotating the query/key vectors by an angle proportional to their position $ m $, the dot product $ q_m^T k_n $ depends only on the relative distance $ m-n $.

$$ \text{RoPE}(x, m) = x \cdot e^{i m \theta} $$

The Extrapolation Property Because RoPE relies on relative rotation, models can often generalize to longer context windows at inference time (e.g., via "YaRN" interpolation) without full re-training.

VI. Mixture of Experts (MoE)

Scaling dense models beyond 1T parameters becomes computationally prohibitive. Mixture of Experts (MoE) (e.g., Mixtral, GPT-4) decouples Parameter Count from Compute Cost.

Sparse Activation: Instead of using all parameters for every token, a **Router** selects the Top-K experts (usually 2) to process the input.

$$ y = \sum_{i=1}^N G(x)_i E_i(x) $$

Where $G(x)$ is the gating probability and $E_i(x)$ is the output of the $i$-th expert FFN.

Load Balancing Loss If the router sends all tokens to Expert 1, we lose parallelism ("Expert Collapse"). An auxiliary loss term creates a penalty if the distribution of elected experts is not uniform.

VII. Linear Attention & RNN Duality

Standard Attention is $O(N^2)$ because we compute the $N \times N$ matrix $QK^T$ first. However, matrix multiplication is associative: $(QK^T)V = Q(K^T V)$.

If we remove the non-linear Softmax (or replace it with a kernel feature map $\\phi(\cdot)$), we can compute $K^T V$ first, which is $O(N)$.

RWKV / Mamba

Modern "Linear Transformers" (SSMs) use this property to run as an RNN during inference (constant memory) while training in parallel.

Long Context

This allows processing sequences of 100k+ tokens, as memory usage grows linearly rather than quadratically.

VIII. Efficient Inference: The KV Cache

During autoregressive generation, re-calculating the entire context at every step is $ O(N^2) $. By caching the Keys and Values of previous tokens, we reduce the per-token latency to $ O(N) $.

Memory Bandwidth Bound GPU inference for LLMs is typically limited not by FLOPS (compute) but by the bandwidth required to move the KV cache from HBM to SRAM for every new token.

KV Cache Calculator

The memory size of the KV cache is huge. For a model with layers $L$, heads $H$, head dim $d$, and precision $P_{bytes}$ (e.g. 2 for FP16):

$$ \text{Size} \\approx 2 \cdot L \cdot H \cdot d \cdot N_{tokens} \cdot P_{bytes} $$

Memory Footprint Analyzer

Calculate real-time VRAM usage. Notice how KV Cache grows linearly with context, properly overwhelming the weights at long contexts.

Context Length (Tokens)

4096

Model Type

Precision

Total VRAM Required

0.0 GB
                                    

For Llama-3-70B (80 layers, GQA 8 KV-heads, dim 128) with 128k context in FP16: $$ 2 \cdot 80 \cdot 8 \cdot 128 \cdot 128,000 \cdot 2 \\approx 42 \text{ GB} $$ This often exceeds the model weights themselves!

Primary Sources & Further Reading

Core Architecture

Vaswani et al. (2017). Attention Is All You Need. (The foundation).
Jay Alammar. The Illustrated Transformer (Visual guide).
Karpathy, A. Let's build GPT: from scratch, in code, spelled out.

Advanced Encodings & Scaling

Su et al. (2021). Roformer: Enhanced Transformer with Rotary Position Embedding.
Kaplan et al. (2020). Scaling Laws for Neural Language Models.
Shazeer, N. (2020). GLU Variants Improve Transformer.