Deep dive into the framework powering modern AI research. Learn to think in tensors,
master the mechanics of automatic differentiation, and understand the dynamic computational graphs that
enable fast prototyping and production-grade model development.
I. The Imperative Mental Model
Transitioning from standard programming to deep learning requires a shift in how we view data
and operations. As described in Paszke et al. (2019), PyTorch was designed
to provide an imperative, "Python-first" experience.
Unlike static graph frameworks (like early TensorFlow), PyTorch constructs the computational
graph on the fly during the forward pass. This allows for standard Python control
flow (if-statements, loops) to be part of the model logic, making debugging and research
much more intuitive.
Architectural Axiom
PyTorch implements an eager execution
model. Every operation acts as a node
construction in a Directed Acyclic Graph (DAG) that is tracked by the dispatcher for
subsequent backward pass optimization.
The **Tensor** is the fundamental unit of data in PyTorch. While it resembles a NumPy array,
a data type, but also the device (CPU/CUDA) and layout.
Storage vs. View: A tensor is physically a 1D array of typically contiguous
bytes in RAM/VRAM known as the Storage. The "Tensor" object is merely a
metadata view (Shape, Stride, Offset) over this storage.
Strides: The stride is a tuple indicating the number of steps needed in the
storage to move by 1 in each dimension. For a tensor \( T \), the address of cell \( (i,
j) \) is: $$ \text{addr} = \text{base\_addr} + i \cdot \text{stride}[0] + j \cdot
\text{stride}[1] $$ Operations like transpose or permute are
typically zero-cost because they only modify the stride tuple, not the underlying memory.
Strides & Efficiency
Operations like transpose() or permute() are essentially zero-cost
metadata updates. They modify the strides without touching the underlying storage memory.
Stride
Mechanics
Change the strides to see
how
the same 1D storage is "viewed" as a 2D tensor. NOTICE: The underlying 1D
storage never changes order.
4
1
0
Physical
Storage (1D
RAM)
⬇ Viewed
As ⬇
Tensor
View (3x4)
III. The Semantics of Broadcasting
Broadcasting allows operations on tensors of different ranks and shapes by logically
expanding dimensions of size 1 to match target dimensions. This follows the standard NumPy
broadcasting convention.
Broadcasting Engine
Observe how a (1,4) vector
is 'broadcast' across a (4,4) matrix. The engine replicates the data logically
without memory duplication.
Element 0
+
=
IV. Automatic Differentiation
**Autograd** is the engine that enables neural networks to learn. It implements a technique
known as Reverse-Mode Automatic Differentiation. As you perform operations on
tensors, PyTorch builds a directed acyclic graph (DAG) where nodes represent tensors and
edges represent functions.
By calling .backward(), the engine traverses this graph in reverse order,
applying the chain rule to compute the gradient of the loss with respect to every leaf
tensor (parameter). This eliminates the need for researchers to manually derive complex
vector derivatives, a process that was the primary bottleneck of 20th-century AI.
Computational Graph
Build a dynamic graph: \( y
=
(x + w) \times 2 \)
2.0
3.0
Dynamic Graph vs. Static Graph
TensorFlow 1.x (Static) builds the graph once and executes it repeatedly. PyTorch (Dynamic)
rebuilds the graph every iteration.
This enables "Define-by-Run"—loops and branches in Python code dynamically change the graph
topology. The trade-off is higher interpreted overhead, which mechanisms like
torch.compile (Graph Capture) aim to mitigate.
The Jacobian-Vector Product (JVP): PyTorch does not fully compute the
Jacobian matrix \( J \) (which would be massive). Instead, it computes the product of the
Jacobian with an incoming gradient vector \( v \):
This accumulation is efficient and naturally supports chain rule composition: \( v_{in}
\cdot J_{total} = (v_{out} \cdot J_n) \cdot J_{n-1} \dots \).
Advanced: Custom Autograd Functions
Sometimes standard operators allow no analytical derivative (e.g., quantization). You can
define a custom torch.autograd.Function by implementing static
forward and backward methods.
class MyReLU(torch.autograd.Function):
@staticmethod
def forward(ctx, input):
ctx.save_for_backward(input) # Save tensor for backward phase
return input.clamp(min=0)
@staticmethod
def backward(ctx, grad_output):
input, = ctx.saved_tensors
grad_input = grad_output.clone()
grad_input[input < 0] = 0 # Mask gradient for negative inputs
return grad_input
V. CUDA Kernels & Async Dispatch
CUDA (Compute Unified Device Architecture) execution is Asynchronous. When
you call a PyTorch function on GPU tensors, the CPU enqueues the kernel into a "Stream" and
immediately returns.
This hides the launch latency. Synchronization (blocking) only happens when you print
values,
copy to CPU (.item()), or assume logical dependencies.
Pinned Memory: For faster Host-Device transfer, use
tensor.pin_memory(). This effectively "page-locks" the RAM, allowing the DMA
(Direct Memory Access) controller to copy data to the GPU without CPU intervention.
Kernel Fusion
Modern compilers (torch.compile) fuse multiple operations into a single Triton/CUDA
kernel to minimize memory-bound bandwidth waste.
Mixed Precision (AMP)
Utilizing Tensor Cores (FP16/BF16) to achieve 2x-5x throughput gains over FP32 on
NVIDIA hardware.
Training on a single GPU is rarely sufficient for modern models. Distributed Data
Parallel (DDP) is the industry standard for multi-GPU training.
Ring All-Reduce
Simulate gradient
synchronization across 4 GPUs.
Status: Idle
The Mechanism:
Replica: The model is replicated across all GPUs (ranks).
Scatter: The batch is split (scattered) so each GPU processes a
different slice of data.
Forward/Backward: Each GPU computes gradients independently.
All-Reduce: Gradients are averaged across all GPUs using a
Ring-AllReduce algorithm.
Update: The optimizer steps on each GPU with the identical average
gradient, keeping weights synchronized.
Gradient Bucketing
To hide communication latency, DDP groups parameter gradients into "buckets"
(default 25MB). As soon
as a bucket is ready (computed), it is asynchronously All-Reduced while the GPU
continues computing
gradients for the next layers.
FSDP (Fully Sharded Data Parallel): For models that don't fit on one GPU
(e.g., Llama-70B), FSDP shards parameters, gradients, and optimizer states across
GPUs, gathering them only when needed for computation.
VII. The Compiler: JIT & TorchScript
Python's overhead (GIL, interpreter dispatch) can be significant for small operations.
TorchScript and torch.compile (PyTorch 2.0) aim to
decouple the model from the Python runtime.
Tracing
torch.jit.trace(model, dummy_input) runs the model once and records the
operations. fast, but fails on dynamic control flow (if/else based on data).
Scripting
torch.jit.script(model) analyzes the Python Abstract Syntax Tree (AST)
to compile the logic, preserving control flow.
PyTorch 2.0 (Dynamo & Inductor): Use
model = torch.compile(model). Dynamo captures the graph by
overriding Python bytecode execution. Inductor generates tuned Triton
kernels for the specific GPU architecture, often yielding 30-200% speedups.
Primary Sources & Further Reading
Documentation & Textbooks
Stevens et al. (2020). Deep Learning with PyTorch.