I. The Alignment Tax & HHH
Pre-training on the Pile or Common Crawl optimizes for the objective \( \\max_\\theta P(x_{next} | x_{context}) \). This produces a "Base Model" capable of high-perplexity completion but prone to toxicity, hallucination, and disobedience. Alignment is the process of steering this probability distribution towards the HHH criteria: Helpful, Honest, and Harmless (Askell et al., 2021).
Accomplishing this without destroying the model's reasoning capabilities (catastrophic forgetting) is the central challenge. The Alignment Tax refers to the empirical observation that aggressive safety filtering often degrades performance on neutral benchmarks (e.g., lower code generation accuracy).
The Imitation Gap
SFT models often hallucinate because they are forced to mimic expert demonstrations even when they lack the underlying knowledge.
Primary Sources & Further Reading
- Ouyang et al. (InstructGPT, 2022). Training language models to follow instructions with human feedback.
- Rafailov et al. (DPO, 2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model.
- Christiano et al. (2017). Deep Reinforcement Learning from Human Preferences.
- Hu et al. (LoRA, 2021). LoRA: Low-Rank Adaptation of Large Language Models.
- Bai et al. (Anthropic, 2022). Constitutional AI: Harmlessness from AI Feedback.
- Dettmers et al. (QLoRA, 2023). QLoRA: Efficient Finetuning of Quantized LLMs.