I. High-Dimensional Fragility
Deep learning models are essentially high-dimensional manifolds embedded in pixel (or token) space. Goodfellow et al. (2014) proposed the Linearity Hypothesis: adversarial examples exist not because deep networks are non-linear, but because they are too linear.
In a 1000-dimensional space, a perturbation of \( \\epsilon \) in the direction of the weight vector adds up to a massive change in the activation: \( w^T(x + \\eta) = w^Tx + w^T\\eta \). The term \( w^T\\eta \) grows linearly with the dimension \( n \).
White-Box Attacks
Attacker has full access to model weights \( \\theta \) and gradients \( \\nabla_x \\mathcal{L} \). (e.g., FGSM, PGD).
Primary Sources & Further Reading
- Goodfellow et al. (2014). Explaining and Harnessing Adversarial Examples.
- Madry et al. (2018). Towards Deep Learning Models Resistant to Adversarial Attacks.
- Cohen et al. (2019). Certified Adversarial Robustness via Randomized Smoothing.
- Zou et al. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models.
- Wei et al. (2023). Jailbroken: How Does LLM Safety Training Fail?
- NIST (2024). Adversarial Machine Learning: A Taxonomy.