Linear Algebra for Deep Learning - Interactive Tutorial

Why Linear Algebra for Deep Learning?

Linear algebra is a branch of mathematics that is widely used throughout science and engineering. A good understanding of linear algebra is essential for understanding and working with many machine learning algorithms, especially deep learning algorithms.

Neural Networks

Matrix operations form the backbone of neural network computations.

Optimization

Gradient descent and other optimization algorithms rely heavily on linear algebra.

Dimensionality Reduction

Techniques like PCA use eigendecomposition and SVD.

Data Representation

Tensors and matrices represent data in machine learning models.

Scalars

A scalar is just a single number, in contrast to most of the other objects studied in linear algebra, which are usually arrays of multiple numbers.

Notation

We write scalars in italics with lowercase variable names. For example: $s \in \mathbb{R}$ (a real-valued scalar) or $n \in \mathbb{N}$ (a natural number scalar).

Examples

Temperature: $T = 25.5°C$
Learning rate: $\alpha = 0.01$
Number of epochs: $n = 100$

Python Example:

# Scalars in Python
import numpy as np

temperature = 25.5
learning_rate = 0.01
num_epochs = 100

print(f"Temperature: {temperature}")
print(f"Learning rate: {learning_rate}")
print(f"Number of epochs: {num_epochs}")

# Scalars are 0-dimensional arrays in NumPy
scalar_array = np.array(42)
print(f"Scalar as NumPy array: {scalar_array}")
print(f"Shape: {scalar_array.shape}")  # Output: ()
print(f"Dimensions: {scalar_array.ndim}")  # Output: 0

Vectors

A vector is an array of numbers. The numbers are arranged in order, and we can identify each individual number by its index in that ordering.

Notation

We give vectors lowercase names in bold typeface, such as $\mathbf{x}$. Elements are identified with subscripts: $x_1, x_2, \ldots, x_n$.

Mathematical Representation

A vector $\mathbf{x}$ with $n$ elements can be written as:

$$\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}$$

Geometric Interpretation

We can think of vectors as identifying points in space, with each element giving the coordinate along a different axis.

Vector Indexing

To access specific elements:

$x_1, x_3, x_6$: Individual elements
$\mathbf{x}_S$ where $S = \{1,3,6\}$: Subset of elements
$\mathbf{x}_{-1}$: All elements except $x_1$

Vector Spaces

If each element is in $\mathbb{R}$ and the vector has $n$ elements, then the vector lies in $\mathbb{R}^n$.

Python Example:

import numpy as np

# Creating vectors
x = np.array([1, 2, 3, 4, 5])
print(f"Vector x: {x}")
print(f"Shape: {x.shape}")  # Output: (5,)
print(f"Dimensions: {x.ndim}")  # Output: 1

# Accessing elements
print(f"First element x[0]: {x[0]}")
print(f"Last element x[-1]: {x[-1]}")

# Subset indexing
indices = [0, 2, 4]  # Elements at positions 1, 3, 5
subset = x[indices]
print(f"Subset x[{indices}]: {subset}")

# Vector as column (2D array with shape (n, 1))
x_column = x.reshape(-1, 1)
print(f"Column vector:\n{x_column}")
print(f"Column shape: {x_column.shape}")  # Output: (5, 1)

# 3D point in space
point_3d = np.array([1.5, -2.0, 3.7])
print(f"3D point: {point_3d}")
print(f"x-coordinate: {point_3d[0]}")
print(f"y-coordinate: {point_3d[1]}")
print(f"z-coordinate: {point_3d[2]}")

Matrices

A matrix is a 2-D array of numbers, so each element is identified by two indices instead of just one.

Notation

We give matrices uppercase variable names with bold typeface, such as $\mathbf{A}$. Elements are identified as $A_{i,j}$ where $i$ is the row and $j$ is the column.

Mathematical Representation

A matrix $\mathbf{A}$ with $m$ rows and $n$ columns ($\mathbf{A} \in \mathbb{R}^{m \times n}$):

$$\mathbf{A} = \begin{bmatrix} A_{1,1} & A_{1,2} & \cdots & A_{1,n} \\ A_{2,1} & A_{2,2} & \cdots & A_{2,n} \\ \vdots & \vdots & \ddots & \vdots \\ A_{m,1} & A_{m,2} & \cdots & A_{m,n} \end{bmatrix}$$

Matrix Indexing

$A_{i,j}$: Element at row $i$, column $j$
$A_{i,:}$: Entire row $i$
$A_{:,j}$: Entire column $j$
$A_{1,1}$: Upper left entry
$A_{m,n}$: Bottom right entry

Matrix Dimensions

If $\mathbf{A}$ has height $m$ and width $n$, we say $\mathbf{A} \in \mathbb{R}^{m \times n}$.

Example: A $3 \times 2$ matrix has 3 rows and 2 columns.

Python Example:

import numpy as np

# Creating matrices
A = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])
print(f"Matrix A:\n{A}")
print(f"Shape: {A.shape}")  # Output: (3, 3)
print(f"Dimensions: {A.ndim}")  # Output: 2

# Accessing elements
print(f"Element A[0,0]: {A[0,0]}")  # Upper left
print(f"Element A[2,2]: {A[2,2]}")  # Bottom right
print(f"Element A[1,2]: {A[1,2]}")  # Row 2, Column 3

# Accessing rows and columns
print(f"First row A[0,:]: {A[0,:]}")
print(f"Second column A[:,1]: {A[:,1]}")

# Matrix properties
m, n = A.shape
print(f"Number of rows (m): {m}")
print(f"Number of columns (n): {n}")

# Different sized matrix
B = np.array([[1, 2],
              [3, 4],
              [5, 6]])
print(f"Matrix B (3x2):\n{B}")
print(f"B shape: {B.shape}")

# Random matrix
C = np.random.randn(2, 4)  # 2x4 matrix with random values
print(f"Random matrix C:\n{C}")
print(f"C shape: {C.shape}")

Tensors

In some cases we need an array with more than two axes. In the general case, an array of numbers arranged on a regular grid with a variable number of axes is known as a tensor.

Notation

We denote a tensor named "A" with this typeface: $\mathsf{A}$. We identify the element at coordinates $(i,j,k)$ by writing $A_{i,j,k}$.

Tensor Hierarchy

0D Tensor: Scalar (single number)
1D Tensor: Vector (array of numbers)
2D Tensor: Matrix (2D array)
3D+ Tensor: Higher-dimensional arrays

Common Applications

Images: 3D tensor (height × width × channels)
Video: 4D tensor (time × height × width × channels)
Batch of images: 4D tensor (batch × height × width × channels)

Example: RGB Image

An RGB image can be represented as a 3D tensor with dimensions:

Height: Number of pixels vertically
Width: Number of pixels horizontally
Channels: 3 (Red, Green, Blue)

Python Example:

import numpy as np

# 3D Tensor (e.g., RGB image: height x width x channels)
image_tensor = np.random.randint(0, 256, size=(224, 224, 3))
print(f"Image tensor shape: {image_tensor.shape}")  # (224, 224, 3)
print(f"Dimensions: {image_tensor.ndim}")  # 3

# Accessing elements in 3D tensor
pixel_value = image_tensor[100, 150, 0]  # Red channel at pixel (100, 150)
print(f"Red value at pixel (100, 150): {pixel_value}")

# 4D Tensor (batch of images: batch_size x height x width x channels)
batch_size = 32
batch_tensor = np.random.randn(batch_size, 224, 224, 3)
print(f"Batch tensor shape: {batch_tensor.shape}")  # (32, 224, 224, 3)
print(f"Dimensions: {batch_tensor.ndim}")  # 4

# Accessing a single image from the batch
first_image = batch_tensor[0]  # Shape: (224, 224, 3)
print(f"First image shape: {first_image.shape}")

# 5D Tensor (video batch: batch_size x time x height x width x channels)
video_tensor = np.random.randn(8, 30, 224, 224, 3)  # 8 videos, 30 frames each
print(f"Video tensor shape: {video_tensor.shape}")  # (8, 30, 224, 224, 3)
print(f"Dimensions: {video_tensor.ndim}")  # 5

# Tensor operations
print(f"Total elements in image tensor: {image_tensor.size}")
print(f"Memory usage (bytes): {image_tensor.nbytes}")

# Reshaping tensors
flattened = image_tensor.reshape(-1)  # Flatten to 1D
print(f"Flattened shape: {flattened.shape}")  # (150528,) = 224*224*3

Transpose

The transpose of a matrix is the mirror image of the matrix across a diagonal line, called the main diagonal, running down and to the right, starting from its upper left corner.

Notation

We denote the transpose of matrix $\mathbf{A}$ as $\mathbf{A}^T$, defined such that $(\mathbf{A}^T)_{i,j} = A_{j,i}$.

Visual Example

Original Matrix $\mathbf{A}$:

$$\mathbf{A} = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}$$

→

Transpose $\mathbf{A}^T$:

$$\mathbf{A}^T = \begin{bmatrix} 1 & 4 \\ 2 & 5 \\ 3 & 6 \end{bmatrix}$$

Properties of Transpose

Vector Transpose

Vectors can be thought of as matrices with only one column. The transpose of a vector is a matrix with only one row:

$\mathbf{x} = [x_1, x_2, x_3]^T$ (column vector)

$\mathbf{x}^T = [x_1, x_2, x_3]$ (row vector)

Scalar Transpose

A scalar is its own transpose:

$a^T = a$

Python Example:

import numpy as np

# Matrix transpose
A = np.array([[1, 2, 3],
              [4, 5, 6]])
print(f"Original matrix A:\n{A}")
print(f"A shape: {A.shape}")  # (2, 3)

# Transpose using .T attribute
A_T = A.T
print(f"Transpose A.T:\n{A_T}")
print(f"A.T shape: {A_T.shape}")  # (3, 2)

# Transpose using transpose() method
A_transpose = A.transpose()
print(f"Using transpose() method:\n{A_transpose}")

# Vector transpose
x = np.array([1, 2, 3])  # 1D array
print(f"Vector x: {x}")
print(f"x shape: {x.shape}")  # (3,)

# To get proper row/column vector behavior, use 2D arrays
x_col = x.reshape(-1, 1)  # Column vector
x_row = x.reshape(1, -1)  # Row vector

print(f"Column vector:\n{x_col}")
print(f"Column shape: {x_col.shape}")  # (3, 1)

print(f"Row vector:\n{x_row}")
print(f"Row shape: {x_row.shape}")  # (1, 3)

# Transpose of column vector gives row vector
x_col_T = x_col.T
print(f"Transpose of column vector:\n{x_col_T}")
print(f"Shape: {x_col_T.shape}")  # (1, 3)

# Scalar transpose (scalars are unchanged)
scalar = np.array(42)
print(f"Scalar: {scalar}")
print(f"Scalar transpose: {scalar.T}")  # Same value

Matrix Multiplication

One of the most important operations involving matrices is multiplication of two matrices. The matrix product of matrices $\mathbf{A}$ and $\mathbf{B}$ is a third matrix $\mathbf{C}$.

Dimension Compatibility

For the product to be defined, $\mathbf{A}$ must have the same number of columns as $\mathbf{B}$ has rows. If $\mathbf{A}$ is $m \times n$ and $\mathbf{B}$ is $n \times p$, then $\mathbf{C}$ is $m \times p$.

Matrix Product Formula

The product operation is defined by:

$$C_{i,j} = \sum_{k} A_{i,k} B_{k,j}$$

Properties of Matrix Multiplication

Distributive: $\mathbf{A}(\mathbf{B} + \mathbf{C}) = \mathbf{AB} + \mathbf{AC}$
Associative: $\mathbf{A}(\mathbf{BC}) = (\mathbf{AB})\mathbf{C}$
Not Commutative: $\mathbf{AB} \neq \mathbf{BA}$ (generally)

Special Products

Dot Product: $\mathbf{x}^T\mathbf{y}$ (vectors)
Element-wise (Hadamard): $\mathbf{A} \odot \mathbf{B}$
Transpose Property: $(\mathbf{AB})^T = \mathbf{B}^T\mathbf{A}^T$

Visual Example

Matrix Multiplication Example

$$\mathbf{A} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}$$

×

$$\mathbf{B} = \begin{bmatrix} 5 & 6 \\ 7 & 8 \end{bmatrix}$$

=

$$\mathbf{C} = \begin{bmatrix} 19 & 22 \\ 43 & 50 \end{bmatrix}$$

$C_{1,1} = 1 \times 5 + 2 \times 7 = 19$

Python Example:

import numpy as np

# Matrix multiplication
A = np.array([[1, 2],
              [3, 4]])
B = np.array([[5, 6],
              [7, 8]])

# Matrix product using @ operator (recommended)
C = A @ B
print(f"A @ B:\n{C}")

# Alternative: using np.dot()
C_dot = np.dot(A, B)
print(f"np.dot(A, B):\n{C_dot}")

# Element-wise multiplication (Hadamard product)
C_hadamard = A * B
print(f"Element-wise A * B:\n{C_hadamard}")

# Dot product of vectors
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])
dot_product = np.dot(x, y)  # or x @ y
print(f"Dot product x·y: {dot_product}")

# Matrix-vector multiplication
v = np.array([1, 2])
result = A @ v
print(f"A @ v: {result}")

# Checking dimensions
print(f"A shape: {A.shape}")
print(f"B shape: {B.shape}")
print(f"C shape: {C.shape}")

# Transpose property
A_T_B_T = A.T @ B.T
AB_T = (A @ B).T
print(f"(AB)^T equals B^T A^T: {np.allclose(AB_T, B.T @ A.T)}")

# Non-commutativity demonstration
AB = A @ B
BA = B @ A
print(f"AB equals BA: {np.allclose(AB, BA)}")
print(f"AB:\n{AB}")
print(f"BA:\n{BA}")

Norms

Sometimes we need to measure the size of a vector. In machine learning, we usually measure the size of vectors using a function called a norm.

Definition

Formally, the $L_p$ norm is given by: $\|\mathbf{x}\|_p = \left(\sum_i |x_i|^p\right)^{1/p}$ for $p \in \mathbb{R}, p \geq 1$.

Properties of a Norm

A norm is any function $f$ that satisfies:

$f(\mathbf{x}) = 0 \Rightarrow \mathbf{x} = \mathbf{0}$
$f(\mathbf{x} + \mathbf{y}) \leq f(\mathbf{x}) + f(\mathbf{y})$ (triangle inequality)
$\forall \alpha \in \mathbb{R}, f(\alpha\mathbf{x}) = |\alpha|f(\mathbf{x})$

Common Norms

L2 Norm (Euclidean): $\|\mathbf{x}\|_2 = \sqrt{\sum_i x_i^2}$
L1 Norm: $\|\mathbf{x}\|_1 = \sum_i |x_i|$
L∞ Norm (Max): $\|\mathbf{x}\|_\infty = \max_i |x_i|$
Squared L2 Norm: $\|\mathbf{x}\|_2^2 = \mathbf{x}^T\mathbf{x}$

Applications

L2: Distance, regularization
L1: Sparsity, feature selection
L∞: Uniform bounds
Frobenius: Matrix norm

Frobenius Norm for Matrices

The Frobenius norm is analogous to the L2 norm for vectors:

$$\|\mathbf{A}\|_F = \sqrt{\sum_{i,j} A_{i,j}^2}$$

Python Example:

import numpy as np

# Vector for norm examples
x = np.array([3, 4, 5])
print(f"Vector x: {x}")

# L2 norm (Euclidean norm)
l2_norm = np.linalg.norm(x)  # Default is L2
l2_norm_manual = np.sqrt(np.sum(x**2))
print(f"L2 norm: {l2_norm}")
print(f"L2 norm (manual): {l2_norm_manual}")

# L1 norm
l1_norm = np.linalg.norm(x, ord=1)
l1_norm_manual = np.sum(np.abs(x))
print(f"L1 norm: {l1_norm}")
print(f"L1 norm (manual): {l1_norm_manual}")

# L-infinity norm (max norm)
linf_norm = np.linalg.norm(x, ord=np.inf)
linf_norm_manual = np.max(np.abs(x))
print(f"L-infinity norm: {linf_norm}")
print(f"L-infinity norm (manual): {linf_norm_manual}")

# Squared L2 norm
squared_l2 = np.dot(x, x)  # x^T x
print(f"Squared L2 norm: {squared_l2}")

# Unit vector (normalized)
x_unit = x / np.linalg.norm(x)
print(f"Unit vector: {x_unit}")
print(f"Unit vector norm: {np.linalg.norm(x_unit)}")

# Matrix norms
A = np.array([[1, 2],
              [3, 4]])

# Frobenius norm
frobenius_norm = np.linalg.norm(A, 'fro')
frobenius_manual = np.sqrt(np.sum(A**2))
print(f"Frobenius norm: {frobenius_norm}")
print(f"Frobenius norm (manual): {frobenius_manual}")

# Different p-norms
for p in [1, 2, 3, np.inf]:
    norm_p = np.linalg.norm(x, ord=p)
    print(f"L{p} norm: {norm_p}")

# Relationship between dot product and norms
y = np.array([1, 2, 3])
dot_product = np.dot(x, y)
norms_product = np.linalg.norm(x) * np.linalg.norm(y)
cos_theta = dot_product / norms_product
print(f"Cosine of angle between x and y: {cos_theta}")

# Distance between vectors
distance = np.linalg.norm(x - y)
print(f"Euclidean distance between x and y: {distance}")

Eigendecomposition

One of the most widely used kinds of matrix decomposition is called eigendecomposition, in which we decompose a matrix into a set of eigenvectors and eigenvalues.

Definition

An eigenvector of a square matrix $\mathbf{A}$ is a nonzero vector $\mathbf{v}$ such that multiplication by $\mathbf{A}$ alters only the scale of $\mathbf{v}$: $\mathbf{A}\mathbf{v} = \lambda\mathbf{v}$

Key Concepts

Eigenvector ($\mathbf{v}$)

A direction that is preserved under the linear transformation represented by matrix $\mathbf{A}$.

Eigenvalue ($\lambda$)

The scaling factor by which the eigenvector is stretched or compressed.

Eigendecomposition Formula

If matrix $\mathbf{A}$ has $n$ linearly independent eigenvectors, we can write:

$$\mathbf{A} = \mathbf{V}\text{diag}(\boldsymbol{\lambda})\mathbf{V}^{-1}$$

where $\mathbf{V}$ is the matrix of eigenvectors and $\boldsymbol{\lambda}$ is the vector of eigenvalues.

Real Symmetric Matrices

Every real symmetric matrix can be decomposed as:

$$\mathbf{A} = \mathbf{Q}\boldsymbol{\Lambda}\mathbf{Q}^T$$

where $\mathbf{Q}$ is orthogonal and $\boldsymbol{\Lambda}$ is diagonal.

Properties Revealed

Matrix is singular ⟺ any eigenvalue is zero
Positive definite ⟺ all eigenvalues > 0
Positive semidefinite ⟺ all eigenvalues ≥ 0

Applications

Principal Component Analysis (PCA)
Stability analysis
Quadratic form optimization

Singular Value Decomposition (SVD)

The singular value decomposition (SVD) provides another way to factorize a matrix, into singular vectors and singular values. The SVD is more generally applicable than eigendecomposition.

SVD Formula

Every real matrix $\mathbf{A}$ can be decomposed as: $\mathbf{A} = \mathbf{U}\mathbf{D}\mathbf{V}^T$

Components of SVD

$\mathbf{U}$ (Left-singular vectors)

$m \times m$ orthogonal matrix. Columns are eigenvectors of $\mathbf{AA}^T$.

$\mathbf{D}$ (Singular values)

$m \times n$ diagonal matrix. Diagonal entries are square roots of eigenvalues of $\mathbf{A}^T\mathbf{A}$.

$\mathbf{V}$ (Right-singular vectors)

$n \times n$ orthogonal matrix. Columns are eigenvectors of $\mathbf{A}^T\mathbf{A}$.

Advantages of SVD

Universal: Defined for every real matrix (not just square)
Numerical stability: More robust than eigendecomposition
Dimensionality reduction: Can truncate to lower rank approximations
Pseudoinverse: Enables generalized matrix inversion

Applications

Principal Component Analysis (PCA)
Image compression
Recommender systems
Noise reduction

Low-rank approximation
Solving linear systems
Data visualization
Feature extraction

Example: Principal Components Analysis (PCA)

Principal Components Analysis (PCA) is a simple machine learning algorithm that can be derived using only knowledge of basic linear algebra. It demonstrates the practical application of eigendecomposition.

PCA Goal

Apply lossy compression to data points by representing them in a lower-dimensional space while minimizing information loss.

PCA Algorithm

1. Encoding Function

$f(\mathbf{x}) = \mathbf{D}^T\mathbf{x}$

Maps input to lower-dimensional code.

2. Decoding Function

$g(\mathbf{c}) = \mathbf{D}\mathbf{c}$

Reconstructs input from code.

Mathematical Derivation

To minimize reconstruction error, we solve:

$$\mathbf{c}^* = \arg\min_{\mathbf{c}} \|\mathbf{x} - g(\mathbf{c})\|_2^2$$

This leads to the optimal encoding:

$$\mathbf{c}^* = \mathbf{D}^T\mathbf{x}$$

Finding the Optimal Basis

The optimal decoding matrix $\mathbf{D}$ is found by solving:

$$\mathbf{D}^* = \arg\max_{\mathbf{D}} \text{Tr}(\mathbf{D}^T\mathbf{X}^T\mathbf{X}\mathbf{D}) \text{ subject to } \mathbf{D}^T\mathbf{D} = \mathbf{I}$$

The solution: $\mathbf{D}$ consists of the eigenvectors of $\mathbf{X}^T\mathbf{X}$ corresponding to the largest eigenvalues.

Key Insights

Principal components are eigenvectors
Eigenvalues indicate variance explained
Orthogonal transformation preserves distances

Applications

Dimensionality reduction
Data visualization
Feature extraction
Noise reduction

Test Your Knowledge

Question 1: What is a scalar?

A 2D array of numbers

A single number

An array of numbers