Inference Optimization

Model Optimization & Efficiency

Bridging the gap between research and production. This module focuses on making models faster, smaller, and cheaper through advanced quantization, knowledge distillation, and efficient training techniques like LoRA and QLoRA.

I. The Memory Wall & VRAM Bottlenecks

In the current era of LLMs, the primary constraint is not compute power (FLOPS), but Memory Bandwidth. Training and serving models with billions of parameters requires moving massive amounts of data from High Bandwidth Memory (HBM) to the GPU cores.

We define the "Memory Wall" as the point where the speed of data transfer becomes the limiting factor in AI performance. To overcome this, we must employ compression techniques that reduce the model's footprint without severely degrading its reasoning capabilities.

Sharpness-Aware Minimization (SAM) Modern optimization isn't just about finding the lowest loss, but finding the **flattest** minima. Foret et al. (2020) introduced SAM, which simultaneously minimizes loss value and loss sharpness. This improves generalization by ensuring the model stays in a "safe" region even if weights are slightly perturbed (e.g., via quantization).

Hessian Curvature

Second-order derivatives define if the loss landscape is a narrow ravine or a flat valley.