Model Optimization & Efficiency

I. The Memory Wall & VRAM Bottlenecks

In the current era of LLMs, the primary constraint is not compute power (FLOPS), but Memory Bandwidth. Training and serving models with billions of parameters requires moving massive amounts of data from High Bandwidth Memory (HBM) to the GPU cores.

We define the "Memory Wall" as the point where the speed of data transfer becomes the limiting factor in AI performance. To overcome this, we must employ compression techniques that reduce the model's footprint without severely degrading its reasoning capabilities.

Training is a search problem in a non-convex space. We move from vanilla Stochastic Gradient Descent (SGD) to adaptive methods that accounting for the **curvature** and **momentum** of the loss surface.

The Efficiency Frontier Optimizing AI is no longer just about accuracy—it is about the "Chinchilla Scaling" balance: achieving the highest intelligence per Watt per second.

Hessian Curvature

Second-order derivatives define if the loss landscape is a narrow ravine or a flat valley.

II. Adaptive Optimization: Beyond SGD

Training Large Language Models requires navigating a highly non-convex loss landscape. Vanilla **Stochastic Gradient Descent (SGD)** struggles with saddle points and ravines. Modern training relies on **Adaptive Moment Estimation (Adam)**.

AdamW (Loshchilov & Hutter, 2017) decouples weight decay from the gradient update. It maintains per-parameter learning rates based on the first moment (mean) $ m_t $ and second moment (variance) $ v_t $ of the gradients.

$$ m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t $$ $$ v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2 $$ $$ \theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} - \eta \lambda \theta_t $$

Why AdamW? In standard Adam with L2 regularization, the weight decay is applied to the gradients, which interacts poorly with the adaptive learning rate scaling. AdamW applies decay directly to the weights, leading to better generalization.

III. PEFT: Low-Rank Adaptation (LoRA)

Full-parameter fine-tuning of LLMs is prohibitively expensive. LoRA (Hu et al., 2021) solves this by freezing the original weights and training two small, low-rank matrices that represent the delta of the weights.

Mathematically, if the original weight matrix is $ W $, we learn its update as $ \\Delta W = A \times B $, where A and B have a very small rank $ r $. This reduces the number of trainable parameters by over 10,000x, enabling fine-tuning on a single GPU.

$$ \mathcal{L} = (1-\alpha)\mathcal{L}_{CE}(y, \hat{y}) + \alpha T^2 \mathcal{L}_{KL}(p_T, p_S) $$

IV. Learning Rate & Scheduler Dynamics

The **Learning Rate ($ \\eta $)** is the most critical hyperparameter. Schedulers like 'Cosine Annealing' or 'OneCycleLR' fluctuate the rate to escape local minima and ensure convergence to flat, generalizable regions.

Scheduler Workbench

Compare Cosine Decay vs. Linear Warmup. Warmup is essential to stabilize the first few thousand training steps.

Scheduler Type

LR Scale Factor

1.0

V. Quantization: INT8, FP4 & NF4

Affine Quantization maps the floating point range $[min, max]$ to the integer range $[0, 2^b-1]$ using a scale factor $ S $ and zero-point $ Z $.

$$ Q(x) = \text{clamp}\left(\text{round}\left(\frac{x}{S} + Z\right), 0, 2^B-1\right) $$ $$ x_{dequant} = S(Q(x) - Z) $$

Normal Float 4 (NF4): Introduced in QLoRA (Dettmers et al., 2023), NF4 is an information-theoretically optimal data type for weights that are normally distributed. It defines the quantization levels based on the quantiles of a standard normal distribution $ \\mathcal{N}(0, 1) $.

Double Quantization QLoRA goes further by quantizing the quantization constants (scales) themselves, saving an additional 0.37 bits per parameter.

VI. Teacher-Student Knowledge Distillation

A compact "Student" model can achieve higher performance by learning from the **soft probabilities** of a massive "Teacher" model rather than the raw hard labels.

$$ \mathcal{L}_{total} = (1-\alpha)\mathcal{L}_{CE}(y, \hat{y}) + \alpha \tau^2 \text{KL}( p(\frac{z_T}{\tau}) || p(\frac{z_S}{\tau}) ) $$

The "Temperature" parameter $ \\tau $ softens the probability distribution, forcing the student to learn the structual relationships between incorrect classes (e.g., "Dog" is more similar to "Cat" than "Car"), known as Dark Knowledge.

Model Optimization & Efficiency

I. The Memory Wall & VRAM Bottlenecks

Hessian Curvature

II. Adaptive Optimization: Beyond SGD

III. PEFT: Low-Rank Adaptation (LoRA)

IV. Learning Rate & Scheduler Dynamics

Scheduler Workbench

V. Quantization: INT8, FP4 & NF4

VI. Teacher-Student Knowledge Distillation

VII. GPU Kernels & Triton

VIII. Pruning & The Lottery Ticket Hypothesis

Unstructured Pruning

Structured Pruning

IX. Advanced Number Formats

Primary Sources & Further Reading