Goodfellow et al: Chapter 02: Linear Algebra

Metadata

Title: Linear Algebra
Number: 2
Book: Goodfellow: Deep Learning

Objects in Linear Algebra

The study of linear algebra involves the study of several mathematical objects:

Scalars: A single number, in the book written in italics, with lowercase variable names.
Vectors: An array of numbers, arranged in order, with individual numbers identifiable by index. Typically denoted in bold lowercase variables, like x, with the first element \(x_1\). If the elements are in \(\mathbb{R}\) and there are n of them, the vector is in the set \(\mathbb{R}^n\).
Matrices: a 2-D array of numbers, so each element is identified by two indices. Usually upper-case, bold variables, like A. If real valued with a height of m and width n, then \(\mathbf{A} \in \mathbb{R}^{m \times n}\) ‘:’ indicates all the elements in the row/column.
Tensors: An array of numbers on a regular grid with a variable number of axes. For when two axes isn’t enough…

Operations

Transposition

A fundamental operation on matrices is the transpose. To transpose a matrix you take a mirror image across a diagonal line, the main diagonal from the top left to bottom right, such that:

\[(\mathbf{A}^T)_{i,j} = \mathbf{A}_{j,i}\]

Vectors can be thought of as matrices with one column, so the transpose of a vector is a matrix with one row. A scalar can be seen as a matrix with a single entry, so a scalar is its own transpose.

Addition and Broadcasting

Matrices can be added so long as they have the same shape, by adding the corresponding elements such that \(C_{i,j} = A_{i,j} + B_{i,j}\).

A scalar can be added to a matrix (and a matrix can be multiplied by a scalar).

In deep learning we also allow vectors to be added to matrices through broadcasting, the implicit copying of the vector to many locations. The result is another matrix: \(C_{i,j} = A_{i,j} + b_j\). The vector \(\mathbf{b}\) is added to each row of the matrix.

Multiplication

There are three multiplication operations on matrices:

The matrix product of A and B is a third matrix C. For this product to be defined A must have the same number of columns as B has rows. If A is \(m \times n\) and B is \(n \times p\) then C is \(m \times p\). This product operation is defined as:
\[C_{i,j} = \sum_k A_{i,k}B_{k,j}\]
The Hadamard product or element-wise product is denoted as \(\mathbf{A} \odot \mathbf{B}\) and is a matrix containing the product of the each individual element.
The dot product between two vectors x and y of the same dimensionality is the matrix product \(\mathbf{x}^T \mathbf{y}\).
Matrix multiplication is distributive, associative, but not commutative. The dot product is commutative though.

Trace

The trace operator gives the sum of all the diagonal entries of a matrix:

\[\mathrm{Tr}(\mathbf{A}) = \sum_i \mathbf{A}_{i,i}\]

Determinant

The determinant of a square matrix - denoted det(A) - maps matrices to real scalars. It equals the product of all the eigenvalues of the matrix. It is a measure of how much multiplication by the matrix expands or contracts space. If the determinant is 0 then space loses all volume, contracting completely along at least one dimension. If the determinant is 1 then it preserves volume.

Identity and Inversion

The identity matrix is one that does not change any vector when multiplied by the matrix. All the entries on the main diagonal are 1, the other entries are 0. The matrix inverse is defined as \(\mathbf{A}^{-1}\mathbf{A} = \mathbf{I}_n\) where I is the identity matrix.

Linear Combination and Span

The span of a set of vectors is the sest of all points obtainable by the linear combination of the original vectors. So determining whether \(\mathbf{Ax} = \mathbf{b}\) has a solution is determining whether b is in the span of the columns of A. This particular span is known as the column space or range of A.

For the system \(\mathbf{Ax} = \mathbf{b}\) to have a solution for all values of \(\mathbf{b} \in \mathbb{R}^m\) then A must have at least m columns, otherwise the dimensionality of the column space would be less than m. Eg consider a 3x2 matrix. The target is 3-D, but x is only 2-D and the best we could do is trace out a 2-D plane within \(\mathbb{R}^3\). We could only find a solution if b is on the plane.

But there could also be redundant columns (eg a 2x2 matrix with identical columns). This kind of redundance is called linear dependence. If no vectors in a set are linear combinations of the other vectors then that set is linearly independent.

Norms

We often need to measure the size of a vector. In ML we do this using a function called a norm which is given by:

\[||\mathbf{x}||_p = \left( \sum_i |x_i|^p \right)^{\frac{1}{p}}\]

for \(p \in \mathbb{R}, p \geq 1\).

So if \(p=2\) we have the Euclidean norm, the Euclidean distance from the origin to the point identified by x. This is known as the \(\mathbf{L^2}\) norm. It is used so frequently in ML that it is often denoted as \(||\mathbf{x}||\).

The squared \(L^2\) norm is often more convenient than the original. It can be calculated as \(\mathbf{x}^T\mathbf{x}\). Near the origin though it grows very slowly and may be undesirable if the difference between zero and nonzero elements is very important, in these cases we often use the \(L^1\) norm instead.

Another norm that is often used is the max norm, \(L^\infty\), which simplifies to the absolute value of the element with the largest magnitude. We can measure the size of a matrix with the Frobenius norm, equivalent to \(L^2\) norm in vectors.

Special Matrices and Vectors

Diagonal matrices have nonzero entries only on the main diagonal. Multiplication by a diagonal matrix is computationally efficient. diag(v) denotes a square diagonal matrix with the elements of v along its main diagonal. Multiplication of a vector x by diag(v) is just \(\mathbf{v} \odot mathbv{x}\). In many cases in ML we might derive some general algorithm in terms of arbitrary matrices but obtain a less expensive, less descriptive version by constraining some matrices to be diagonal.
Symmetric matrices is equal to its own transpose. Often generated by functions of two arguments where the order of the arguments doesn’t matter (eg the distance from a to b).
A unit vector is a vector with unit norm: \(||x||_2 = 1\)
Vectors are orthogonal to each other if \(\mathbf{x}^T\mathbf{y} = 0\). If they have a nonzero norm then they are at a 90 degree angle to each other. In \(\mathbb{R}^n\) at most n vectors may be mutually orthogonal with nonzero norm. If the vectors are orthogonal with unit norm they are orthonormal.
An orthogonal matrix is a square matrix whose rows are mutually orthogonal and whose columns are mutually orthonormal: \(\mathbf{A}^T\mathbf{A} = \mathbf{A}\mathbf{A}^T = \mathbf{I}\) or \(\mathbf{A}^{-1} = \mathbf{A}^T\). They are interesting because their inverses are very cheap to compute.

Eigendecomposition

Often we want to break mathematical objects into constitutent parts, or find some properties of them that are universal and not related to how we represent them. Eg integers can be decomposed into prime factors, and this can tell us some of their functional properties. Likewise with matrices, we can decompose them in ways that show us information about their properties not obvious from looking at them as arrays of elements.

One of the most common methods of decomposing matrices is eigendecomposition, where we deconstruct a matrix into a set of eigenvectors and eigenvalues:

An eigenvector of a square matrix \(\mathbf{A}\) is a nonzero vector \(\mathbf{v}\) such that mutliplication by \(\mathbf{A}\) alters only the scale of \(\mathbf{v}\): \(\mathbf{A}\mathbf{v} = \lambda \mathbf{v}\). We are generally only interested in unit eigenvectors.
The scalar \(\lambda\) is known as the eigenvalue corresponding to this eigenvector.
The eigendecomposition of a matrix is the concatenation of all linearly independent eigenvectors and their corresponding eigenvalues. The eigenvectors are concatenated into a matrix, one eigenvector per column. The eigenvalues are concatenated into a vector, \(\mathbf{\lambda}\). This leaves the eigendecomposition as: \(\mathbf{A} = \mathbf{V}\mathrm{diag}(\mathbf{\lambda})\mathbf{V}^{-1}\).

Fortunately in DL we usually need to decompose only a specific class of matrices that have a simple decomposition. Every real symmetric matrix can be decomposed into an expression using only real-valued (not complex) eigenvalues and eigenvectors \(\mathbf{A} = \mathbf{Q} \mathbf{\Lambda} \mathbf{Q}^T\).

Singular Value Decomposition and Pseudoinversion

An alternative method for decomposing a matrix is the singular value decomposition where we write the matrix as the product of three matrices: \(\mathbf{A} = \mathbf{U}\mathbf{D}\mathbf{V}^T\). U and V are both orthogonal, D is diagonal.

Elements along the main diagonal of D are known as the singular values of A, and are the roots of the eigenvalues of \(\mathbf{A}^T\mathbf{A}\).

Columns of U are known as the left-singular vectors and are the eigenvectors of \(\mathbf{A}\mathbf{A}T\). Columns of V are the right-singular vectors and are the eigenvectors of \(\mathbf{A}^T\mathbf{A}\).

One major use of SVD is that it can be used to partially generalize matrix inversion to non-square matrices. See p. 43/44 of the book for this method.

Principal Component Analsyis

The chapter concludes by deriving principal component analysis from linear algebra and vector calculus. As this is covered later in the module it’s not noted here. See pp. 45ff in the book for the derivation.

Alex's Notes