Neural Network Architectures

I. Connectionism & Layered Representation

Deep learning is the modern implementation of Connectionism. As established in the work of Goodfellow, Bengio, and Courville (2016), intelligence in these systems is not found in a single node, but in the collective interaction of thousands of parameterized units.

We begin by defining the architecture: a sequence of non-linear transformations that iteratively distill raw input into high-level abstract features. This "Deep" structure allows the model to handle the complexity and variance of real-world data, such as images and natural language.

Minsky & Papert (1969) The realization that a single layer cannot solve the XOR problem led to the first "AI Winter," which was eventually resolved by the introduction of Multi-Layer Perceptrons (MLP) and Backpropagation.

Functional Hierarchy

Earlier layers detect primitives (edges, textures); deeper layers synthesize abstract concepts (objects, semantics).

II. The Biological Neuron vs. The Artificial Perceptron

While early researchers like Rosenblatt (1958) were inspired by the human brain, modern artificial neurons follow a simplified mathematical abstraction. A neuron in a deep network is an Accumulator followed by a Gate.

The strength of the "synapse" is represented by the Weight, while the threshold of triggering is controlled by the Bias. When we stack these units, we gain the ability to represent increasingly complex manifolds, eventually reaching the limits defined by the Universal Approximation Theorem.

Universal Mappings By chaining these operations, we effectively warp the input space into a representation where the target classes are easily separable.

III. The Universal Approximation Theorem

Cybenko (1989) proved that an MLP with a single hidden layer and finite neurons can approximate any continuous function on a compact subset of $\\mathbb{R}^n$.

Theoretical Infinity

While a single layer *can* represent any function, deep networks are more *efficient* (parameter-wise) and have superior structural priors for natural data (e.g., images/text).

IV. Weight Initialization Theory

Proper initialization is not magic; it is about preserving the variance of activations across layers. If variance drops (vanishes), the signal dies. If it grows (explodes), activations saturate. We aim for $ \text{Var}(y) \\approx \text{Var}(x) $.

Assuming linear activation and zero mean, the variance of a neuron's output $ y = \\sum w_i x_i $ is:

$$ \text{Var}(y) = n_{in} \cdot \text{Var}(w) \cdot \text{Var}(x) $$

Variance Propagation

Observe how signal variance evolves through a 10-layer network. We want the ratio to stay near 1.0 (Green).

Init Mode

Layer Width (Fan-In)

500

Variance Ratio (Out/In)

1.00

To keep variance partitioning consistent ($ \text{Var}(y) = \text{Var}(x) $), we need $ n_{in} \text{Var}(w) = 1 $, or $ \text{Var}(w) = \\frac{1}{n_{in}} $.

Xavier (Glorot)

Optimized for Sigmoid/Tanh (symmetric). Considers both fan-in and fan-out.

$$ \text{Var}(W) = \\frac{2}{n_{in} + n_{out}} $$

V. Deriving Backpropagation

**Backpropagation** is the specific application of the chain rule to the parameters of a neural network. Formalized by Rumelhart, Hinton, and Williams (1986), it allows us to compute the "blame" each weight bears for the final error in a single backward pass.

By flowing the gradient from the output layer back toward the input, we can efficiently update billions of parameters. However, this process is susceptible to the Vanishing Gradient Problem.

Consider the derivative of the Sigmoid function $ \\sigma(z) $: $$ \ sigma'(z) = \\sigma(z)(1 - \\sigma(z)) $$ The maximum value of this derivative is 0.25 (at $ z=0 $). In a deep network, gradients are computed via the chain rule, multiplying these derivatives at each layer: $$ \ frac{\\partial L}{\\partial W_1} \\propto \\sigma'(z_1) \cdot \\sigma'(z_2) \cdots \\sigma'(z_L) $$ If we have 10 layers, the gradient is scaled by $ 0.25^{10} \\approx 0.0000009 $. The signal effectively vanishes, and early layers stop learning. This necessitates standardizing on ReLU (gradient is 1) and Residual Connections (gradient highway).

$$ \frac{\partial L}{\partial W^{[l]}} = \frac{\partial L}{\partial Z^{[l]}} \cdot \frac{\partial Z^{[l]}}{\partial W^{[l]}} $$

Gradient Flow Analyzer

Observe the "Vanishing Gradient" effect. As you add more layers, the gradient at the input layer becomes infinitesimally small.

Network Depth (Layers)

3 Layers

Neural Network Architectures

I. Connectionism & Layered Representation

Functional Hierarchy

II. The Biological Neuron vs. The Artificial Perceptron

III. The Universal Approximation Theorem

Theoretical Infinity

IV. Weight Initialization Theory

Variance Propagation

Xavier (Glorot)

V. Deriving Backpropagation

Gradient Flow Analyzer

VI. Second-Order Optimization

VII. Normalization Taxonomy

BatchNorm (BN)

LayerNorm (LN)

GroupNorm (GN)

InstanceNorm (IN)

VIII. The Residual Pathology

IX. Generalization & Regularization