Model Alignment & Adaptation

Fine-Tuning & Model Adaptation

Adapt pre-trained models to specific domains using Post-Training (PT) techniques. Explore the mathematics of PEFT (LoRA), the optimization landscape of RLHF (PPO), and the emergence of direct preference learning algorithms (DPO).

I. The Alignment Tax & HHH

Pre-training on the Pile or Common Crawl optimizes for the objective \( \\max_\\theta P(x_{next} | x_{context}) \). This produces a "Base Model" capable of high-perplexity completion but prone to toxicity, hallucination, and disobedience. Alignment is the process of steering this probability distribution towards the HHH criteria: Helpful, Honest, and Harmless (Askell et al., 2021).

Accomplishing this without destroying the model's reasoning capabilities (catastrophic forgetting) is the central challenge. The Alignment Tax refers to the empirical observation that aggressive safety filtering often degrades performance on neutral benchmarks (e.g., lower code generation accuracy).

Supervised Fine-Tuning (SFT) The first step is Behavior Cloning. We fine-tune the base model on a high-quality dataset of (Instruction, Response) pairs. $$ \mathcal{L}_{SFT} = -\sum \log P_\theta(y | x) $$ This aligns the format but fails to capture the nuance of human preference, especially when multiple valid answers exist.

The Imitation Gap

SFT models often hallucinate because they are forced to mimic expert demonstrations even when they lack the underlying knowledge.

Primary Sources & Further Reading

Alignment Foundations
  • Ouyang et al. (InstructGPT, 2022). Training language models to follow instructions with human feedback.
  • Rafailov et al. (DPO, 2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model.
  • Christiano et al. (2017). Deep Reinforcement Learning from Human Preferences.
PEFT & Safety
  • Hu et al. (LoRA, 2021). LoRA: Low-Rank Adaptation of Large Language Models.
  • Bai et al. (Anthropic, 2022). Constitutional AI: Harmlessness from AI Feedback.
  • Dettmers et al. (QLoRA, 2023). QLoRA: Efficient Finetuning of Quantized LLMs.