Linear Algebra for Deep Learning
Master the essential mathematical foundations of deep learning through this interactive, step-by-step tutorial.
Why Linear Algebra for Deep Learning?
Linear algebra is a branch of mathematics that is widely used throughout science and engineering. A good understanding of linear algebra is essential for understanding and working with many machine learning algorithms, especially deep learning algorithms.
Neural Networks
Matrix operations form the backbone of neural network computations.
Optimization
Gradient descent and other optimization algorithms rely heavily on linear algebra.
Dimensionality Reduction
Techniques like PCA use eigendecomposition and SVD.
Data Representation
Tensors and matrices represent data in machine learning models.
Scalars
A scalar is just a single number, in contrast to most of the other objects studied in linear algebra, which are usually arrays of multiple numbers.
Notation
We write scalars in italics with lowercase variable names. For example: $s \in \mathbb{R}$ (a real-valued scalar) or $n \in \mathbb{N}$ (a natural number scalar).
Examples
- Temperature: $T = 25.5°C$
- Learning rate: $\alpha = 0.01$
- Number of epochs: $n = 100$
Python Example:
# Scalars in Python
import numpy as np
temperature = 25.5
learning_rate = 0.01
num_epochs = 100
print(f"Temperature: {temperature}")
print(f"Learning rate: {learning_rate}")
print(f"Number of epochs: {num_epochs}")
# Scalars are 0-dimensional arrays in NumPy
scalar_array = np.array(42)
print(f"Scalar as NumPy array: {scalar_array}")
print(f"Shape: {scalar_array.shape}") # Output: ()
print(f"Dimensions: {scalar_array.ndim}") # Output: 0
Vectors
A vector is an array of numbers. The numbers are arranged in order, and we can identify each individual number by its index in that ordering.
Notation
We give vectors lowercase names in bold typeface, such as $\mathbf{x}$. Elements are identified with subscripts: $x_1, x_2, \ldots, x_n$.
Mathematical Representation
A vector $\mathbf{x}$ with $n$ elements can be written as:
Geometric Interpretation
We can think of vectors as identifying points in space, with each element giving the coordinate along a different axis.
Vector Indexing
To access specific elements:
- $x_1, x_3, x_6$: Individual elements
- $\mathbf{x}_S$ where $S = \{1,3,6\}$: Subset of elements
- $\mathbf{x}_{-1}$: All elements except $x_1$
Vector Spaces
If each element is in $\mathbb{R}$ and the vector has $n$ elements, then the vector lies in $\mathbb{R}^n$.
Python Example:
import numpy as np
# Creating vectors
x = np.array([1, 2, 3, 4, 5])
print(f"Vector x: {x}")
print(f"Shape: {x.shape}") # Output: (5,)
print(f"Dimensions: {x.ndim}") # Output: 1
# Accessing elements
print(f"First element x[0]: {x[0]}")
print(f"Last element x[-1]: {x[-1]}")
# Subset indexing
indices = [0, 2, 4] # Elements at positions 1, 3, 5
subset = x[indices]
print(f"Subset x[{indices}]: {subset}")
# Vector as column (2D array with shape (n, 1))
x_column = x.reshape(-1, 1)
print(f"Column vector:\n{x_column}")
print(f"Column shape: {x_column.shape}") # Output: (5, 1)
# 3D point in space
point_3d = np.array([1.5, -2.0, 3.7])
print(f"3D point: {point_3d}")
print(f"x-coordinate: {point_3d[0]}")
print(f"y-coordinate: {point_3d[1]}")
print(f"z-coordinate: {point_3d[2]}")
Matrices
A matrix is a 2-D array of numbers, so each element is identified by two indices instead of just one.
Notation
We give matrices uppercase variable names with bold typeface, such as $\mathbf{A}$. Elements are identified as $A_{i,j}$ where $i$ is the row and $j$ is the column.
Mathematical Representation
A matrix $\mathbf{A}$ with $m$ rows and $n$ columns ($\mathbf{A} \in \mathbb{R}^{m \times n}$):
Matrix Indexing
- $A_{i,j}$: Element at row $i$, column $j$
- $A_{i,:}$: Entire row $i$
- $A_{:,j}$: Entire column $j$
- $A_{1,1}$: Upper left entry
- $A_{m,n}$: Bottom right entry
Matrix Dimensions
If $\mathbf{A}$ has height $m$ and width $n$, we say $\mathbf{A} \in \mathbb{R}^{m \times n}$.
Example: A $3 \times 2$ matrix has 3 rows and 2 columns.
Python Example:
import numpy as np
# Creating matrices
A = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
print(f"Matrix A:\n{A}")
print(f"Shape: {A.shape}") # Output: (3, 3)
print(f"Dimensions: {A.ndim}") # Output: 2
# Accessing elements
print(f"Element A[0,0]: {A[0,0]}") # Upper left
print(f"Element A[2,2]: {A[2,2]}") # Bottom right
print(f"Element A[1,2]: {A[1,2]}") # Row 2, Column 3
# Accessing rows and columns
print(f"First row A[0,:]: {A[0,:]}")
print(f"Second column A[:,1]: {A[:,1]}")
# Matrix properties
m, n = A.shape
print(f"Number of rows (m): {m}")
print(f"Number of columns (n): {n}")
# Different sized matrix
B = np.array([[1, 2],
[3, 4],
[5, 6]])
print(f"Matrix B (3x2):\n{B}")
print(f"B shape: {B.shape}")
# Random matrix
C = np.random.randn(2, 4) # 2x4 matrix with random values
print(f"Random matrix C:\n{C}")
print(f"C shape: {C.shape}")
Tensors
In some cases we need an array with more than two axes. In the general case, an array of numbers arranged on a regular grid with a variable number of axes is known as a tensor.
Notation
We denote a tensor named "A" with this typeface: $\mathsf{A}$. We identify the element at coordinates $(i,j,k)$ by writing $A_{i,j,k}$.
Tensor Hierarchy
- 0D Tensor: Scalar (single number)
- 1D Tensor: Vector (array of numbers)
- 2D Tensor: Matrix (2D array)
- 3D+ Tensor: Higher-dimensional arrays
Common Applications
- Images: 3D tensor (height × width × channels)
- Video: 4D tensor (time × height × width × channels)
- Batch of images: 4D tensor (batch × height × width × channels)
Example: RGB Image
An RGB image can be represented as a 3D tensor with dimensions:
- Height: Number of pixels vertically
- Width: Number of pixels horizontally
- Channels: 3 (Red, Green, Blue)
Python Example:
import numpy as np
# 3D Tensor (e.g., RGB image: height x width x channels)
image_tensor = np.random.randint(0, 256, size=(224, 224, 3))
print(f"Image tensor shape: {image_tensor.shape}") # (224, 224, 3)
print(f"Dimensions: {image_tensor.ndim}") # 3
# Accessing elements in 3D tensor
pixel_value = image_tensor[100, 150, 0] # Red channel at pixel (100, 150)
print(f"Red value at pixel (100, 150): {pixel_value}")
# 4D Tensor (batch of images: batch_size x height x width x channels)
batch_size = 32
batch_tensor = np.random.randn(batch_size, 224, 224, 3)
print(f"Batch tensor shape: {batch_tensor.shape}") # (32, 224, 224, 3)
print(f"Dimensions: {batch_tensor.ndim}") # 4
# Accessing a single image from the batch
first_image = batch_tensor[0] # Shape: (224, 224, 3)
print(f"First image shape: {first_image.shape}")
# 5D Tensor (video batch: batch_size x time x height x width x channels)
video_tensor = np.random.randn(8, 30, 224, 224, 3) # 8 videos, 30 frames each
print(f"Video tensor shape: {video_tensor.shape}") # (8, 30, 224, 224, 3)
print(f"Dimensions: {video_tensor.ndim}") # 5
# Tensor operations
print(f"Total elements in image tensor: {image_tensor.size}")
print(f"Memory usage (bytes): {image_tensor.nbytes}")
# Reshaping tensors
flattened = image_tensor.reshape(-1) # Flatten to 1D
print(f"Flattened shape: {flattened.shape}") # (150528,) = 224*224*3
Transpose
The transpose of a matrix is the mirror image of the matrix across a diagonal line, called the main diagonal, running down and to the right, starting from its upper left corner.
Notation
We denote the transpose of matrix $\mathbf{A}$ as $\mathbf{A}^T$, defined such that $(\mathbf{A}^T)_{i,j} = A_{j,i}$.
Visual Example
Original Matrix $\mathbf{A}$:
→
Transpose $\mathbf{A}^T$:
Properties of Transpose
Vector Transpose
Vectors can be thought of as matrices with only one column. The transpose of a vector is a matrix with only one row:
$\mathbf{x} = [x_1, x_2, x_3]^T$ (column vector)
$\mathbf{x}^T = [x_1, x_2, x_3]$ (row vector)
Scalar Transpose
A scalar is its own transpose:
$a^T = a$
Python Example:
import numpy as np
# Matrix transpose
A = np.array([[1, 2, 3],
[4, 5, 6]])
print(f"Original matrix A:\n{A}")
print(f"A shape: {A.shape}") # (2, 3)
# Transpose using .T attribute
A_T = A.T
print(f"Transpose A.T:\n{A_T}")
print(f"A.T shape: {A_T.shape}") # (3, 2)
# Transpose using transpose() method
A_transpose = A.transpose()
print(f"Using transpose() method:\n{A_transpose}")
# Vector transpose
x = np.array([1, 2, 3]) # 1D array
print(f"Vector x: {x}")
print(f"x shape: {x.shape}") # (3,)
# To get proper row/column vector behavior, use 2D arrays
x_col = x.reshape(-1, 1) # Column vector
x_row = x.reshape(1, -1) # Row vector
print(f"Column vector:\n{x_col}")
print(f"Column shape: {x_col.shape}") # (3, 1)
print(f"Row vector:\n{x_row}")
print(f"Row shape: {x_row.shape}") # (1, 3)
# Transpose of column vector gives row vector
x_col_T = x_col.T
print(f"Transpose of column vector:\n{x_col_T}")
print(f"Shape: {x_col_T.shape}") # (1, 3)
# Scalar transpose (scalars are unchanged)
scalar = np.array(42)
print(f"Scalar: {scalar}")
print(f"Scalar transpose: {scalar.T}") # Same value
Matrix Multiplication
One of the most important operations involving matrices is multiplication of two matrices. The matrix product of matrices $\mathbf{A}$ and $\mathbf{B}$ is a third matrix $\mathbf{C}$.
Dimension Compatibility
For the product to be defined, $\mathbf{A}$ must have the same number of columns as $\mathbf{B}$ has rows. If $\mathbf{A}$ is $m \times n$ and $\mathbf{B}$ is $n \times p$, then $\mathbf{C}$ is $m \times p$.
Matrix Product Formula
The product operation is defined by:
Properties of Matrix Multiplication
- Distributive: $\mathbf{A}(\mathbf{B} + \mathbf{C}) = \mathbf{AB} + \mathbf{AC}$
- Associative: $\mathbf{A}(\mathbf{BC}) = (\mathbf{AB})\mathbf{C}$
- Not Commutative: $\mathbf{AB} \neq \mathbf{BA}$ (generally)
Special Products
- Dot Product: $\mathbf{x}^T\mathbf{y}$ (vectors)
- Element-wise (Hadamard): $\mathbf{A} \odot \mathbf{B}$
- Transpose Property: $(\mathbf{AB})^T = \mathbf{B}^T\mathbf{A}^T$
Visual Example
Matrix Multiplication Example
$C_{1,1} = 1 \times 5 + 2 \times 7 = 19$
Python Example:
import numpy as np
# Matrix multiplication
A = np.array([[1, 2],
[3, 4]])
B = np.array([[5, 6],
[7, 8]])
# Matrix product using @ operator (recommended)
C = A @ B
print(f"A @ B:\n{C}")
# Alternative: using np.dot()
C_dot = np.dot(A, B)
print(f"np.dot(A, B):\n{C_dot}")
# Element-wise multiplication (Hadamard product)
C_hadamard = A * B
print(f"Element-wise A * B:\n{C_hadamard}")
# Dot product of vectors
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])
dot_product = np.dot(x, y) # or x @ y
print(f"Dot product x·y: {dot_product}")
# Matrix-vector multiplication
v = np.array([1, 2])
result = A @ v
print(f"A @ v: {result}")
# Checking dimensions
print(f"A shape: {A.shape}")
print(f"B shape: {B.shape}")
print(f"C shape: {C.shape}")
# Transpose property
A_T_B_T = A.T @ B.T
AB_T = (A @ B).T
print(f"(AB)^T equals B^T A^T: {np.allclose(AB_T, B.T @ A.T)}")
# Non-commutativity demonstration
AB = A @ B
BA = B @ A
print(f"AB equals BA: {np.allclose(AB, BA)}")
print(f"AB:\n{AB}")
print(f"BA:\n{BA}")
Norms
Sometimes we need to measure the size of a vector. In machine learning, we usually measure the size of vectors using a function called a norm.
Definition
Formally, the $L_p$ norm is given by: $\|\mathbf{x}\|_p = \left(\sum_i |x_i|^p\right)^{1/p}$ for $p \in \mathbb{R}, p \geq 1$.
Properties of a Norm
A norm is any function $f$ that satisfies:
- $f(\mathbf{x}) = 0 \Rightarrow \mathbf{x} = \mathbf{0}$
- $f(\mathbf{x} + \mathbf{y}) \leq f(\mathbf{x}) + f(\mathbf{y})$ (triangle inequality)
- $\forall \alpha \in \mathbb{R}, f(\alpha\mathbf{x}) = |\alpha|f(\mathbf{x})$
Common Norms
- L2 Norm (Euclidean): $\|\mathbf{x}\|_2 = \sqrt{\sum_i x_i^2}$
- L1 Norm: $\|\mathbf{x}\|_1 = \sum_i |x_i|$
- L∞ Norm (Max): $\|\mathbf{x}\|_\infty = \max_i |x_i|$
- Squared L2 Norm: $\|\mathbf{x}\|_2^2 = \mathbf{x}^T\mathbf{x}$
Applications
- L2: Distance, regularization
- L1: Sparsity, feature selection
- L∞: Uniform bounds
- Frobenius: Matrix norm
Frobenius Norm for Matrices
The Frobenius norm is analogous to the L2 norm for vectors:
Python Example:
import numpy as np
# Vector for norm examples
x = np.array([3, 4, 5])
print(f"Vector x: {x}")
# L2 norm (Euclidean norm)
l2_norm = np.linalg.norm(x) # Default is L2
l2_norm_manual = np.sqrt(np.sum(x**2))
print(f"L2 norm: {l2_norm}")
print(f"L2 norm (manual): {l2_norm_manual}")
# L1 norm
l1_norm = np.linalg.norm(x, ord=1)
l1_norm_manual = np.sum(np.abs(x))
print(f"L1 norm: {l1_norm}")
print(f"L1 norm (manual): {l1_norm_manual}")
# L-infinity norm (max norm)
linf_norm = np.linalg.norm(x, ord=np.inf)
linf_norm_manual = np.max(np.abs(x))
print(f"L-infinity norm: {linf_norm}")
print(f"L-infinity norm (manual): {linf_norm_manual}")
# Squared L2 norm
squared_l2 = np.dot(x, x) # x^T x
print(f"Squared L2 norm: {squared_l2}")
# Unit vector (normalized)
x_unit = x / np.linalg.norm(x)
print(f"Unit vector: {x_unit}")
print(f"Unit vector norm: {np.linalg.norm(x_unit)}")
# Matrix norms
A = np.array([[1, 2],
[3, 4]])
# Frobenius norm
frobenius_norm = np.linalg.norm(A, 'fro')
frobenius_manual = np.sqrt(np.sum(A**2))
print(f"Frobenius norm: {frobenius_norm}")
print(f"Frobenius norm (manual): {frobenius_manual}")
# Different p-norms
for p in [1, 2, 3, np.inf]:
norm_p = np.linalg.norm(x, ord=p)
print(f"L{p} norm: {norm_p}")
# Relationship between dot product and norms
y = np.array([1, 2, 3])
dot_product = np.dot(x, y)
norms_product = np.linalg.norm(x) * np.linalg.norm(y)
cos_theta = dot_product / norms_product
print(f"Cosine of angle between x and y: {cos_theta}")
# Distance between vectors
distance = np.linalg.norm(x - y)
print(f"Euclidean distance between x and y: {distance}")
Eigendecomposition
One of the most widely used kinds of matrix decomposition is called eigendecomposition, in which we decompose a matrix into a set of eigenvectors and eigenvalues.
Definition
An eigenvector of a square matrix $\mathbf{A}$ is a nonzero vector $\mathbf{v}$ such that multiplication by $\mathbf{A}$ alters only the scale of $\mathbf{v}$: $\mathbf{A}\mathbf{v} = \lambda\mathbf{v}$
Key Concepts
Eigenvector ($\mathbf{v}$)
A direction that is preserved under the linear transformation represented by matrix $\mathbf{A}$.
Eigenvalue ($\lambda$)
The scaling factor by which the eigenvector is stretched or compressed.
Eigendecomposition Formula
If matrix $\mathbf{A}$ has $n$ linearly independent eigenvectors, we can write:
where $\mathbf{V}$ is the matrix of eigenvectors and $\boldsymbol{\lambda}$ is the vector of eigenvalues.
Real Symmetric Matrices
Every real symmetric matrix can be decomposed as:
where $\mathbf{Q}$ is orthogonal and $\boldsymbol{\Lambda}$ is diagonal.
Properties Revealed
- Matrix is singular ⟺ any eigenvalue is zero
- Positive definite ⟺ all eigenvalues > 0
- Positive semidefinite ⟺ all eigenvalues ≥ 0
Applications
- Principal Component Analysis (PCA)
- Stability analysis
- Quadratic form optimization
Singular Value Decomposition (SVD)
The singular value decomposition (SVD) provides another way to factorize a matrix, into singular vectors and singular values. The SVD is more generally applicable than eigendecomposition.
SVD Formula
Every real matrix $\mathbf{A}$ can be decomposed as: $\mathbf{A} = \mathbf{U}\mathbf{D}\mathbf{V}^T$
Components of SVD
$\mathbf{U}$ (Left-singular vectors)
$m \times m$ orthogonal matrix. Columns are eigenvectors of $\mathbf{AA}^T$.
$\mathbf{D}$ (Singular values)
$m \times n$ diagonal matrix. Diagonal entries are square roots of eigenvalues of $\mathbf{A}^T\mathbf{A}$.
$\mathbf{V}$ (Right-singular vectors)
$n \times n$ orthogonal matrix. Columns are eigenvectors of $\mathbf{A}^T\mathbf{A}$.
Advantages of SVD
- Universal: Defined for every real matrix (not just square)
- Numerical stability: More robust than eigendecomposition
- Dimensionality reduction: Can truncate to lower rank approximations
- Pseudoinverse: Enables generalized matrix inversion
Applications
- Principal Component Analysis (PCA)
- Image compression
- Recommender systems
- Noise reduction
- Low-rank approximation
- Solving linear systems
- Data visualization
- Feature extraction
Example: Principal Components Analysis (PCA)
Principal Components Analysis (PCA) is a simple machine learning algorithm that can be derived using only knowledge of basic linear algebra. It demonstrates the practical application of eigendecomposition.
PCA Goal
Apply lossy compression to data points by representing them in a lower-dimensional space while minimizing information loss.
PCA Algorithm
1. Encoding Function
$f(\mathbf{x}) = \mathbf{D}^T\mathbf{x}$
Maps input to lower-dimensional code.
2. Decoding Function
$g(\mathbf{c}) = \mathbf{D}\mathbf{c}$
Reconstructs input from code.
Mathematical Derivation
To minimize reconstruction error, we solve:
This leads to the optimal encoding:
Finding the Optimal Basis
The optimal decoding matrix $\mathbf{D}$ is found by solving:
The solution: $\mathbf{D}$ consists of the eigenvectors of $\mathbf{X}^T\mathbf{X}$ corresponding to the largest eigenvalues.
Key Insights
- Principal components are eigenvectors
- Eigenvalues indicate variance explained
- Orthogonal transformation preserves distances
Applications
- Dimensionality reduction
- Data visualization
- Feature extraction
- Noise reduction