AI Cybersecurity & Robustness

Adversarial Robustness & Red Teaming

Secure AI systems against gradient-based attacks and prompt injection. Analyze the geometry of high-dimensional adversarial examples (FGSM, GCG) and implement certified defenses like Randomized Smoothing.

I. High-Dimensional Fragility

Deep learning models are essentially high-dimensional manifolds embedded in pixel (or token) space. Goodfellow et al. (2014) proposed the Linearity Hypothesis: adversarial examples exist not because deep networks are non-linear, but because they are too linear.

In a 1000-dimensional space, a perturbation of \( \\epsilon \) in the direction of the weight vector adds up to a massive change in the activation: \( w^T(x + \\eta) = w^Tx + w^T\\eta \). The term \( w^T\\eta \) grows linearly with the dimension \( n \).

Concentration of Measure Adversarial examples are not "bugs"; they are inevitable statistical properties of high-dimensional spheres. Near any point on the data manifold, there exists another point very close by (in \( L_p \) distance) that lies on the other side of the decision boundary.

White-Box Attacks

Attacker has full access to model weights \( \\theta \) and gradients \( \\nabla_x \\mathcal{L} \). (e.g., FGSM, PGD).

Primary Sources & Further Reading

Adversarial Foundations
  • Goodfellow et al. (2014). Explaining and Harnessing Adversarial Examples.
  • Madry et al. (2018). Towards Deep Learning Models Resistant to Adversarial Attacks.
  • Cohen et al. (2019). Certified Adversarial Robustness via Randomized Smoothing.
LLM Security
  • Zou et al. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models.
  • Wei et al. (2023). Jailbroken: How Does LLM Safety Training Fail?
  • NIST (2024). Adversarial Machine Learning: A Taxonomy.