Adversarial Robustness & Red Teaming

I. High-Dimensional Fragility

Deep learning models are essentially high-dimensional manifolds embedded in pixel (or token) space. Goodfellow et al. (2014) proposed the Linearity Hypothesis: adversarial examples exist not because deep networks are non-linear, but because they are too linear.

In a 1000-dimensional space, a perturbation of $ \\epsilon $ in the direction of the weight vector adds up to a massive change in the activation: $ w^T(x + \\eta) = w^Tx + w^T\\eta $. The term $ w^T\\eta $ grows linearly with the dimension $ n $.

Concentration of Measure Adversarial examples are not "bugs"; they are inevitable statistical properties of high-dimensional spheres. Near any point on the data manifold, there exists another point very close by (in $ L_p $ distance) that lies on the other side of the decision boundary.

White-Box Attacks

Attacker has full access to model weights $ \\theta $ and gradients $ \\nabla_x \\mathcal{L} $. (e.g., FGSM, PGD).

II. Fast Gradient Sign Method (FGSM)

To generate an adversarial example, we wish to maximize the loss $ J(\\theta, x, y) $ under the constraint that the perturbation size $ \\| \\eta \\|_\\infty < \\epsilon $.

The optimal single-step attack is to sign the gradient:

$$ x_{adv} = x + \epsilon \cdot \text{sign}(\nabla_x J(\theta, x, y)) $$

Projected Gradient Descent (PGD) (Madry et al., 2018) iterates this step multiple times, projecting the result back into the $ \\epsilon $-ball. PGD is considered the "strongest first-order attack."

Adversarial Perturbation

Adjust the epsilon ($ \\epsilon $) to see how imperceptible noise can flip a classification from 'Panda' to 'Gibbon'.

Perturbation Strength ($ \\epsilon $)

0.02

Panda (99.8%)

→

Gibbon (12.4%)

                                    [ADVERSARIAL_NOISE_MAP_GENERATING...]

III. GCG Attack (Zou et al.)

Attacking LLMs is discrete (tokens), not continuous. The Greedy Coordinate Gradient (GCG) attack (Zou et al., 2023) finds a universal adversarial suffix (e.g., ! ! ! !) that acts like a "magic spell."

Algorithm Steps:

1. Gradient Calculation: Compute gradients of the loss w.r.t one-hot token vectors at the suffix positions.
2. Top-k Candidates: Identification of promising token replacements that maximize the likelihood of the target string (e.g., "Sure, here is...").
3. Greedy Selection: Evaluate the exact loss for the best candidates and update the suffix.

Prompt Injection Unlike jailbreaks which use psychological manipulation (Persona Play), GCG is an optimization-based attack. It exploits the model's weights directly, often producing nonsensical text that nevertheless triggers the "compliance" mode.

IV. Token Smuggling & Obfuscation

Simple safety filters check for keywords ("bomb"). Attackers bypass this via **Token Smuggling**: splitting malicious tokens into sub-tokens that assemble only in the model's latent space.

Example Payload

We don't need to change weights to control behavior. Representation Engineering (RepE) involves identifying specific direction vectors in the residual stream that encode concepts like "Refusal", "Honesty", or "Anger".

By adding or subtracting these steering vectors during inference (ActAdd), we can jailbreak a model (remove refusal) or make it truthful (add honesty) without any fine-tuning.

$$ h_l \leftarrow h_l + \alpha \cdot v_{steering} $$

V. Activation Steering (RepE)

Representation Engineering (RepE) (Zou et al., 2023) moves beyond weight updates. It identifies specific direction vectors in the model's residual stream that encode concepts like "Refusal", "Honesty", or "Anger".

$$ h_l \leftarrow h_l + \alpha \cdot v_{steering} $$

By determining the "Honesty Vector" via PCA on pairs of true/false statements, we can mathematically "inject" truthfulness into the model's activations during inference, bypassing refusal mechanisms or correcting hallucinations.

Circuit Breakers Zou et al. propose "Circuit Breaking" to identify the exact sub-network responsible for harmful outputs and ablate (zero-out) those neurons, rendering the model incapable of generating the harmful content even if prompted.

VI. Prompt Injection Taxonomy

Direct Injection

"Ignore previous instructions and do X." The attacker is the user.

Indirect Injection

The attacker poisons a webpage (e.g., in white text). When the user asks an LLM to "summarize this page," the model executes the hidden instruction (e.g., "Phish the user").

Multi-Modal Injection

Embedding text instructions inside images that are invisible to humans but legible to Vision Encoders (CLIP/ViT).

VII. Certified Robustness

Adversarial Training (training on PGD examples) is empirical. Randomized Smoothing (Cohen et al., 2019) provides a mathematical guarantee. By smoothing the classifier $ f $ with Gaussian noise $ \\mathcal{N}(0, \\sigma^2 I) $, we create a new classifier $ g $.

If the smoothed classifier classifies $ x $ as class $ c_A $ with probability $ p_A > 1/2 $, then the prediction is robust within a radius $ R $:

$$ R = \sigma \cdot \Phi^{-1}(p_A) $$

where $ \\Phi^{-1} $ is the inverse standard normal CDF. This guarantees that no perturbation smaller than $ R $ can change the classifier's prediction.

Primary Sources & Further Reading

Adversarial Foundations

Goodfellow et al. (2014). Explaining and Harnessing Adversarial Examples.
Madry et al. (2018). Towards Deep Learning Models Resistant to Adversarial Attacks.
Cohen et al. (2019). Certified Adversarial Robustness via Randomized Smoothing.

LLM Security

Zou et al. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models.
Wei et al. (2023). Jailbroken: How Does LLM Safety Training Fail?
NIST (2024). Adversarial Machine Learning: A Taxonomy.