I. The Memory Wall & VRAM Bottlenecks
In the current era of LLMs, the primary constraint is not compute power (FLOPS), but Memory Bandwidth. Training and serving models with billions of parameters requires moving massive amounts of data from High Bandwidth Memory (HBM) to the GPU cores.
We define the "Memory Wall" as the point where the speed of data transfer becomes the limiting factor in AI performance. To overcome this, we must employ compression techniques that reduce the model's footprint without severely degrading its reasoning capabilities.
Training is a search problem in a non-convex space. We move from vanilla Stochastic Gradient Descent (SGD) to adaptive methods that accounting for the **curvature** and **momentum** of the loss surface.
Hessian Curvature
Second-order derivatives define if the loss landscape is a narrow ravine or a flat valley.