I. Connectionism & Layered Representation
Deep learning is the modern implementation of Connectionism. As established in the work of Goodfellow, Bengio, and Courville (2016), intelligence in these systems is not found in a single node, but in the collective interaction of thousands of parameterized units.
We begin by defining the architecture: a sequence of non-linear transformations that iteratively distill raw input into high-level abstract features. This "Deep" structure allows the model to handle the complexity and variance of real-world data, such as images and natural language.
Functional Hierarchy
Earlier layers detect primitives (edges, textures); deeper layers synthesize abstract concepts (objects, semantics).
II. The Biological Neuron vs. The Artificial Perceptron
While early researchers like Rosenblatt (1958) were inspired by the human brain, modern artificial neurons follow a simplified mathematical abstraction. A neuron in a deep network is an Accumulator followed by a Gate.
The strength of the "synapse" is represented by the Weight, while the threshold of triggering is controlled by the Bias. When we stack these units, we gain the ability to represent increasingly complex manifolds, eventually reaching the limits defined by the Universal Approximation Theorem.
III. The Universal Approximation Theorem
Cybenko (1989) proved that an MLP with a single hidden layer and finite neurons can approximate any continuous function on a compact subset of $\mathbb{R}^n$.
Theoretical Infinity
While a single layer *can* represent any function, deep networks are more *efficient* (parameter-wise) and have superior structural priors for natural data (e.g., images/text).
IV. Weight Initialization Theory
Proper initialization is not magic; it is about preserving the variance of activations across layers. If variance drops (vanishes), the signal dies. If it grows (explodes), activations saturate. We aim for $\text{Var}(y) \approx \text{Var}(x)$.
Assuming linear activation and zero mean, the variance of a neuron's output $y = \sum w_i x_i$ is:
Variance Propagation
Observe how signal variance evolves through a 10-layer network. We want the ratio to stay near 1.0 (Green).
To keep variance partitioning consistent ($\text{Var}(y) = \text{Var}(x)$), we need $n_{in} \text{Var}(w) = 1$, or $\text{Var}(w) = \frac{1}{n_{in}}$.
Xavier (Glorot)
Optimized for Sigmoid/Tanh (symmetric). Considers both fan-in and fan-out.
$$ \text{Var}(W) = \frac{2}{n_{in} + n_{out}} $$V. Deriving Backpropagation
**Backpropagation** is the specific application of the chain rule to the parameters of a neural network. Formalized by Rumelhart, Hinton, and Williams (1986), it allows us to compute the "blame" each weight bears for the final error in a single backward pass.
By flowing the gradient from the output layer back toward the input, we can efficiently update billions of parameters. However, this process is susceptible to the Vanishing Gradient Problem.
Consider the derivative of the Sigmoid function \( \sigma(z) \): $$ \sigma'(z) = \sigma(z)(1 - \sigma(z)) $$ The maximum value of this derivative is 0.25 (at \( z=0 \)). In a deep network, gradients are computed via the chain rule, multiplying these derivatives at each layer: $$ \frac{\partial L}{\partial W_1} \propto \sigma'(z_1) \cdot \sigma'(z_2) \cdots \sigma'(z_L) $$ If we have 10 layers, the gradient is scaled by \( 0.25^{10} \approx 0.0000009 \). The signal effectively vanishes, and early layers stop learning. This necessitates standardizing on ReLU (gradient is 1) and Residual Connections (gradient highway).
Gradient Flow Analyzer
Observe the "Vanishing Gradient" effect. As you add more layers, the gradient at the input layer becomes infinitesimally small.
VI. Second-Order Optimization
Standard SGD uses only the Gradient (First Derivative, slope). Second-order methods use the Hessian Matrix (Second Derivative, curvature) to jump directly to the minima.
The Saddle Point Problem: In high dimensions, local minima are rare. Most critical points (\(\nabla J = 0\)) are **Saddle Points** (minima in some directions, maxima in others). Looking at the Hessian's eigenvalues allows us to distinguish these.
VII. Normalization Taxonomy
Internal Covariate Shift makes training deep networks unstable. We normalize activations \(x\) using: $$ \hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} $$
BatchNorm (BN)
Normalizes across the Batch dimension. Fast, but fails with small batches (e.g., detection).
LayerNorm (LN)
Normalizes across the Feature dimension. Independent of batch size. Standard for Transformers/RNNs.
GroupNorm (GN)
Splits features into groups. A compromise between BN and LN, commonly used in CNNs.
InstanceNorm (IN)
Normalizes per channel per image. Used in Style Transfer to remove content contrast info.
VIII. The Residual Pathology
Balduzzi et al. (2017) showed that in deep standard networks, gradients behave like White Noise ("Shattered Gradients"). Meaning, the gradient correlation between layers drops exponentially.
ResNets ($y = x + F(x)$) act as an Identity Highway. During backprop, the gradient distributes: $$ \frac{\partial L}{\partial x} = \frac{\partial L}{\partial y}(1 + \frac{\partial F}{\partial x}) $$
This "$+1$" term ensures that gradient flow never fully dies, allowing training of 1000+ layer networks.
IX. Generalization & Regularization
- Goodfellow et al. (2016). Deep Learning (Ch. 6: Deep Feedforward Networks).
- Stanford CS231n: Convolutional Neural Networks for Visual Recognition (Karpathy et al.).
- Nielsen, M. (2015). Neural Networks and Deep Learning.
- Rumelhart et al. (1986). Learning representations by back-propagating errors. Nature.
- Cybenko, G. (1989). Approximation by Superpositions of a Sigmoidal Function.
- LeCun et al. (1989). Backpropagation Applied to Handwritten Zip Code Recognition.