I. High-Dimensional Fragility
Deep learning models are essentially high-dimensional manifolds embedded in pixel (or token) space. Goodfellow et al. (2014) proposed the Linearity Hypothesis: adversarial examples exist not because deep networks are non-linear, but because they are too linear.
In a 1000-dimensional space, a perturbation of $\epsilon$ in the direction of the weight vector adds up to a massive change in the activation: $w^T(x + \eta) = w^Tx + w^T\eta$. The term $w^T\eta$ grows linearly with the dimension $n$.
White-Box Attacks
Attacker has full access to model weights \( \\theta \) and gradients \( \\nabla_x \\mathcal{L} \). (e.g., FGSM, PGD).
Primary Sources & Further Reading
- Goodfellow et al. (2014). Explaining and Harnessing Adversarial Examples.
- Madry et al. (2018). Towards Deep Learning Models Resistant to Adversarial Attacks.
- Cohen et al. (2019). Certified Adversarial Robustness via Randomized Smoothing.
- Zou et al. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models.
- Wei et al. (2023). Jailbroken: How Does LLM Safety Training Fail?
- NIST (2024). Adversarial Machine Learning: A Taxonomy.