PyTorch Fundamentals

I. The Imperative Mental Model

Transitioning from standard programming to deep learning requires a shift in how we view data and operations. As described in Paszke et al. (2019), PyTorch was designed to provide an imperative, "Python-first" experience.

Unlike static graph frameworks (like early TensorFlow), PyTorch constructs the computational graph on the fly during the forward pass. This allows for standard Python control flow (if-statements, loops) to be part of the model logic, making debugging and research much more intuitive.

Architectural Axiom PyTorch implements an eager execution model. Every operation acts as a node construction in a Directed Acyclic Graph (DAG) that is tracked by the dispatcher for subsequent backward pass optimization.

Rank & Topology

Understanding scalar (0), vector (1), matrix (2), and n-dimensional tensor layouts.

II. Tensor Semantics & Memory

The **Tensor** is the fundamental unit of data in PyTorch. While it resembles a NumPy array, a data type, but also the device (CPU/CUDA) and layout.

Storage vs. View: A tensor is physically a 1D array of typically contiguous bytes in RAM/VRAM known as the Storage. The "Tensor" object is merely a metadata view (Shape, Stride, Offset) over this storage.

Strides: The stride is a tuple indicating the number of steps needed in the storage to move by 1 in each dimension. For a tensor $ T $, the address of cell $ (i, j) $ is: $$ \text{addr} = \text{base\_addr} + i \cdot \text{stride}[0] + j \cdot \text{stride}[1] $$ Operations like transpose or permute are typically zero-cost because they only modify the stride tuple, not the underlying memory.

Strides & Efficiency Operations like transpose() or permute() are essentially zero-cost metadata updates. They modify the strides without touching the underlying storage memory.

Stride Mechanics

Change the strides to see how the same 1D storage is "viewed" as a 2D tensor. NOTICE: The underlying 1D storage never changes order.

Stride Dim 0 (Rows)

Stride Dim 1 (Cols)

Storage Offset

Physical Storage (1D RAM)

⬇ Viewed As ⬇

Tensor View (3x4)

III. The Semantics of Broadcasting

Broadcasting allows operations on tensors of different ranks and shapes by logically expanding dimensions of size 1 to match target dimensions. This follows the standard NumPy broadcasting convention.

Broadcasting Engine

Observe how a (1,4) vector is 'broadcast' across a (4,4) matrix. The engine replicates the data logically without memory duplication.

Broadcast Bias ($ +b $)

Element 0

IV. Automatic Differentiation

**Autograd** is the engine that enables neural networks to learn. It implements a technique known as Reverse-Mode Automatic Differentiation. As you perform operations on tensors, PyTorch builds a directed acyclic graph (DAG) where nodes represent tensors and edges represent functions.

By calling .backward(), the engine traverses this graph in reverse order, applying the chain rule to compute the gradient of the loss with respect to every leaf tensor (parameter). This eliminates the need for researchers to manually derive complex vector derivatives, a process that was the primary bottleneck of 20th-century AI.

Computational Graph

Build a dynamic graph: $ y = (x + w) \times 2 $

Input (x)

2.0

Weight (w)

3.0

Dynamic Graph vs. Static Graph TensorFlow 1.x (Static) builds the graph once and executes it repeatedly. PyTorch (Dynamic) rebuilds the graph every iteration. This enables "Define-by-Run"—loops and branches in Python code dynamically change the graph topology. The trade-off is higher interpreted overhead, which mechanisms like torch.compile (Graph Capture) aim to mitigate.

The Jacobian-Vector Product (JVP): PyTorch does not fully compute the Jacobian matrix $ J $ (which would be massive). Instead, it computes the product of the Jacobian with an incoming gradient vector $ v $:

$$ \mathbf{v}^T \mathbf{J} = \left[ \sum_i v_i \frac{\partial y_i}{\partial x_1}, \dots, \sum_i v_i \frac{\partial y_i}{\partial x_n} \right] $$

This accumulation is efficient and naturally supports chain rule composition: $ v_{in} \cdot J_{total} = (v_{out} \cdot J_n) \cdot J_{n-1} \dots $.

Advanced: Custom Autograd Functions

Sometimes standard operators allow no analytical derivative (e.g., quantization). You can define a custom torch.autograd.Function by implementing static forward and backward methods.

class MyReLU(torch.autograd.Function):
    @staticmethod
    def forward(ctx, input):
        ctx.save_for_backward(input) # Save tensor for backward phase
        return input.clamp(min=0)

    @staticmethod
    def backward(ctx, grad_output):
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0 # Mask gradient for negative inputs
        return grad_input

V. CUDA Kernels & Async Dispatch

CUDA (Compute Unified Device Architecture) execution is Asynchronous. When you call a PyTorch function on GPU tensors, the CPU enqueues the kernel into a "Stream" and immediately returns. This hides the launch latency. Synchronization (blocking) only happens when you print values, copy to CPU (.item()), or assume logical dependencies.

Pinned Memory: For faster Host-Device transfer, use tensor.pin_memory(). This effectively "page-locks" the RAM, allowing the DMA (Direct Memory Access) controller to copy data to the GPU without CPU intervention.

Kernel Fusion

Modern compilers (torch.compile) fuse multiple operations into a single Triton/CUDA kernel to minimize memory-bound bandwidth waste.

Mixed Precision (AMP)

Utilizing Tensor Cores (FP16/BF16) to achieve 2x-5x throughput gains over FP32 on NVIDIA hardware.

Training on a single GPU is rarely sufficient for modern models. Distributed Data Parallel (DDP) is the industry standard for multi-GPU training.

Ring All-Reduce

Simulate gradient synchronization across 4 GPUs.

                                    Status: Idle
                                

The Mechanism:

Replica: The model is replicated across all GPUs (ranks).
Scatter: The batch is split (scattered) so each GPU processes a different slice of data.
Forward/Backward: Each GPU computes gradients independently.
All-Reduce: Gradients are averaged across all GPUs using a Ring-AllReduce algorithm.
Update: The optimizer steps on each GPU with the identical average gradient, keeping weights synchronized.

Gradient Bucketing To hide communication latency, DDP groups parameter gradients into "buckets" (default 25MB). As soon as a bucket is ready (computed), it is asynchronously All-Reduced while the GPU continues computing gradients for the next layers.

FSDP (Fully Sharded Data Parallel): For models that don't fit on one GPU (e.g., Llama-70B), FSDP shards parameters, gradients, and optimizer states across GPUs, gathering them only when needed for computation.

VII. The Compiler: JIT & TorchScript

Python's overhead (GIL, interpreter dispatch) can be significant for small operations. TorchScript and torch.compile (PyTorch 2.0) aim to decouple the model from the Python runtime.

Tracing

torch.jit.trace(model, dummy_input) runs the model once and records the operations. fast, but fails on dynamic control flow (if/else based on data).

Scripting

torch.jit.script(model) analyzes the Python Abstract Syntax Tree (AST) to compile the logic, preserving control flow.

PyTorch 2.0 (Dynamo & Inductor): Use model = torch.compile(model). Dynamo captures the graph by overriding Python bytecode execution. Inductor generates tuned Triton kernels for the specific GPU architecture, often yielding 30-200% speedups.

Primary Sources & Further Reading

Documentation & Textbooks

Stevens et al. (2020). Deep Learning with PyTorch.
Official PyTorch Documentation: pytorch.org/docs
Hugging Face Course: Introduction to PyTorch.

Research Papers

Paszke et al. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library.
Baydin et al. (2018). Automatic Differentiation in Machine Learning: A Survey.
Rumelhart et al. (1986). Learning representations by back-propagating errors.