Notes on Multivariate Data Analysis via Matrix Decomposition
20190930
These are the notes taken on my master course Multivariate Data Analysis via Matrix Decomposition. If you’re confused with the course name, you can think of this as a statistical course on unsupervised learning.\(\newcommand{1}[1]{\unicode{x1D7D9}_{\{#1\}}}\newcommand{span}{\text{span}}\newcommand{bs}{\boldsymbol}\newcommand{R}{\mathbb{R}}\newcommand{rank}{\text{rank}}\newcommand{\norm}[1]{\left\lVert#1\right\rVert}\newcommand{diag}{\text{diag}}\newcommand{tr}{\text{tr}}\newcommand{braket}[1]{\left\langle#1\right\rangle}\newcommand{C}{\mathbb{C}}\)
Notations
First let’s give some standard notations used in this course. Let’s assume no prior knowledge in linear algebra and start from matrix multiplication.
Matrix Multiplication
We denote a matrix \(\bs{A}\in\R^{m\times n}\), with its entries defined as \([a_{ij}]_{i,j=1}^{m,n}\). Similarly, we define \(\bs{B}=[b_{jk}]_{j,k=1}^{n,p}\) and thus the multiplication is defined as \(\bs{AB} = [\sum_{j=1}^n a_{ij}b_{jk}]_{i,k=1}^{n,p}\), which can also be represented in three other ways:
 vector form, using \(\bs{a}\) and \(\bs{b}\)
 a matrix of products of \(A\) and \(\bs{b}\)
 a matrix of products of \(\bs{a}\) and \(\bs{B}\)
A special example of such representation: let’s assume
\[ \bs{A}=[\bs{a}_1,\bs{a}_2,\ldots,\bs{a}_n]\in\R^{m\times n}\text{ and } \bs{D} = \diag(d_1,d_2,\ldots,d_n) \in\R^{n\times n},\]
then we have right away \(\bs{AD}=[\bs{a}_id_i]_{i=1}^n\).
Exercise
With multiplication we care ranks of matrices. There is a quick conclusion: If \(\bs{x}\neq \bs{0}, \bs{y}\neq \bs{0}\), then \(\rank(\bs{xy'})=1\). Conversely, if \(\rank(\bs{A})=1\), then \(\exists\ \bs{x}\neq \bs{0}, \bs{y}\neq \bs{0}\) s.t. \(\bs{xy'}=\bs{A}\). Prove it.
Norms
There are two types of norms in this course we consider:
 (Euclidean) We define the \(l^1\)norm as \(\norm{x}_1 = \sum_{i=1}^n x_i\), define \(l^2\)norm as \(\norm{x}_2 = \sqrt{\bs{x'x}}\), define \(l^{\infty}\)norm as \(\norm{x}_{\infty} = \max_{1\le i \le n}\{x_i\}\), and define the Mahalanobis norm as \(\norm{x}_A = \sqrt{\bs{x'Ax}}\).
 (Frobenius) We define the Frobenius norm of a matrix as \(\norm{\bs{A}}_F=\sqrt{\sum_{i=1}^m\sum_{j=1}^n a_{ij}^2}\). The spectral 2norm of a matrix is defined as \(\norm{\bs{A}}_2=\max_{\bs{x}\neq \bs{0}} \norm{\bs{Ax}}_2 / \norm{\bs{x}}_2\).
Property
What makes these norms?
 \(\norm{\bs{v}}=0\) iff. \(\bs{v}=\bs{0}\).
 \(\norm{\alpha \bs{v}} = \alpha\cdot\norm{\bs{v}}\) for any \(\alpha\in\R\) and any \(\bs{v}\in\mathcal{V}\).
 (Triangular Inequality) \(\norm{\bs{u} + \bs{v}} \le \norm{\bs{u}} + \norm{\bs{v}}\) for any \(\bs{u}, \bs{v}\in\mathcal{V}\).
 (Submultiplicative) \(\norm{\bs{AB}}\le \norm{\bs{A}}\cdot \norm{\bs{B}}\) for every formable matrices \(\bs{A}\) and \(\bs{B}\).
Exercise
Try to prove them for Euclidean 2norm, Frobenius norm and spectral 2norm.
Inner Products
There are two types of inner products we consider:
 (Euclidean) We define the inner product of vectors \(\bs{x},\bs{y}\in\R^n\) as \(\bs{x'y}=\sum_{i=1}^n x_iy_i\).
 (Frobenius) We define the inner product of matrices \(\bs{A},\bs{B}\in\R^{m\times n}\) as \(\braket{\bs{A},\bs{B}}=\tr(\bs{A'B})=\sum_{i=1}^m\sum_{j=1}^n a_{ij}b_{ij}\).
A famous inequality related to these inner products is the CauchySchwarz inequality, which states
 (Euclidean) \(\bs{x'y}\le \norm{\bs{x}}_2\cdot\norm{\bs{y}}_2\) for any \(\bs{x,y}\in\R^n\).
 (Frobenius) \(\braket{\bs{A},\bs{B}}\le\norm{\bs{A}}_F\cdot\norm{\bs{B}}_F\) for any \(\bs{A},\bs{B}\in\R^{m\times n}\).
Eigenvalue Decomposition (EVD)
The first matrix decomposition we’re gonna talk about is the eigenvalue decomposition.
Eigenvalues and Eigenvectors
For square matrix \(\bs{A}\in\R^{n\times n}\), if \(\bs{0}\neq \bs{x}\in\C^n\) and \(\lambda\in\C\) is s.t. \(\bs{Ax} = \lambda\bs{x}\), then \(\lambda\) is called an engenvalue of \(\bs{A}\) and \(\bs{x}\) is called the \(\lambda\)engenvector of \(\bs{A}\).
Ideally, we want a matrix to have \(n\) eigenvectors and \(n\) corresponding eigenvectors, linearly independent to each other. This is not always true.
Existence of EVD
Theorem
\(\bs{A}\in\R^{n\times n}\) have \(n\) eigenvalues iff. there exists an invertible \(\bs{X}\in\R^{n\times n}\) s.t. \(\bs{X}^{1}\bs{A}\bs{X}=\bs{\Lambda}\), i.e. \(\bs{A}\) is diagonizable. This gives \(\bs{A}=\bs{X}\bs{\Lambda}\bs{X}^{1}\), which is called the eigenvalue decomposition (EVD).
Theorem
(Spectral Theorem for Symmetric Matrices) For symmetric matrix \(\bs{A}\in\R^{n\times n}\) there always exists an orthogonal matrix \(\bs{Q}\), namely \(\bs{Q}'\bs{Q}=\bs{I}\), that gives
\[ \bs{A}=\bs{Q\Lambda Q}' = \sum_{i=1}^n \lambda_i \bs{q}_i \bs{q}_i' \]
where \(\bs{q}\) are column vectors of \(\bs{Q}\). This is called the symmetric EVD, aka. \(\bs{A}\) being orthogonally diagonalizable.
Properties of EVD
Property
We have several properties following the second theorem above. For all \(i=1,2,\ldots, n\)
 \(\bs{A}\bs{q}_i = \lambda_i \bs{q}_i\) (can be proved using \(\bs{Q}^{1}=\bs{Q}'\))
 \(\norm{\bs{q}_i}_2=1\) (can be proved using \(\bs{QQ}'=\bs{I}\))
The second theorem above can also be represented as
Theorem
If \(\bs{A}=\bs{A}'\), then \(\bs{A}\) has \(n\) orthogonal eigenvectors.
Singular Value Decomposition (SVD)
For general matrices, we have singular value decomposition.
Definition
The most famous form of SVD is define as
\[ \bs{A} = \bs{U} \bs{\Sigma} \bs{V}' \]
where \(\bs{A}\in\R^{m\times n}\), \(\bs{U}\in\R^{m\times m}\), \(\bs{\Sigma}\in\R^{m\times n}\) and \(\bs{V}\in\R^{n\times n}\). Specifically, both \(\bs{U}\) and \(\bs{V}\) are orthogonal (i.e. \(\bs{U}'\bs{U}=\bs{I}\), same for \(\bs{V}\)) and \(\bs{\Sigma}\) is diagonal. Usually, we choose the singular values to be nondecreasing, namely
\[ \bs{\Sigma}=\diag(\sigma_1,\sigma_2,\ldots,\sigma_{\min\{m,n\}})\quad\text{where}\quad \sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_{\min\{m,n\}}. \]
Terminology
Here we define a list of terms that’ll be used from time to time:
 (SVD) \(\bs{A} = \bs{U} \bs{\Sigma} \bs{V}'\).
 (Left Singular Vectors) Columns of \(\bs{U}\).
 (Right Singular Vectors) Columns of \(\bs{V}\).
 (Singular Values) Diagonal entries of \(\bs{\Sigma}\).
Three Forms of SVD
Besides the regular SVD given above, we have the outer product SVD:
\[ \bs{A} = \sum_{i=1}^{\min\{m,n\}}\!\!\!\sigma_i \bs{u}_i \bs{v}_i' \]
and condensed SVD:
\[ \bs{A} = \bs{U}_r\bs{\Sigma}_r\bs{V}_r' \]
where \(r=\rank(\bs{A})\) is also the number of nonzero singular values. In this form, we have \(\bs{\Sigma}_r\in\R^{r\times r}\) with proper chunked \(\bs{U}_r\) and \(\bs{V}_r\).
Existence of SVD
Theorem
(Existence of SVD) Let \(\bs{A}\in\R^{m\times n}\) and \(r=\rank(\bs{A})\). Then \(\exists\ \bs{U}_r\in\R^{m\times r}\), \(\bs{V}_r\in\R^{n\times r}\) and \(\bs{\Sigma}_r\in\R^{r\times r}\) s.t. \(\bs{A} = \bs{U}_r\bs{\Sigma}_r\bs{V}_r'\) where \(\bs{U}_r\) and \(\bs{V}_r\) are orthogonal and \(\bs{\Sigma}_r\) is diagonal. This means condensed SVD exists and therefore the rest two forms.
Proof
Define symmetric \(\bs{W}\in\R^{(m+n)\times(m+n)}\) as
\[ \bs{W} = \begin{bmatrix} \bs{0} & \bs{A} \\ \bs{A}' & \bs{0} \end{bmatrix} \]
which has an orthogonal EVD as \(\bs{W} = \bs{Z}\bs{\Lambda}\bs{Z}'\) where \(\bs{Z}'\bs{Z}=\bs{I}\). Now, assume \(\bs{z}\in\R^{m+n}\) is an eigenvector of \(\bs{W}\) corresponding to \(\lambda\), then \(\bs{W}\bs{z} = \lambda \bs{z}\). Denote the first \(m\) entries of \(\bs{z}\) as \(\bs{x}\) and the rest \(\bs{y}\), which gives
\[ \begin{bmatrix} \bs{0} & \bs{A}\\ \bs{A}' & \bs{0} \end{bmatrix} \begin{bmatrix} \bs{x} \\ \bs{y} \end{bmatrix} = \lambda \begin{bmatrix} \bs{x} \\ \bs{y} \end{bmatrix} \Rightarrow \begin{cases} \bs{Ay} = \lambda \bs{x},\\ \bs{A}'\bs{x} = \lambda \bs{y}. \end{cases} \]
Using this results
\[ \begin{bmatrix} \bs{0} & \bs{A}\\ \bs{A}' & \bs{0} \end{bmatrix} \begin{bmatrix} \bs{x} \\ \bs{y} \end{bmatrix} = \begin{bmatrix} \bs{Ay} \\ \bs{A}'\bs{y} \end{bmatrix} = \begin{bmatrix} \lambda \bs{x}\\ \lambda \bs{y} \end{bmatrix} = \lambda\begin{bmatrix} \bs{x}\\ \bs{y} \end{bmatrix} \]
which means \(\lambda\) is also an engenvalue of \(\bs{W}\). Hence, we know
\[ \begin{align} \bs{W} &= \bs{Z}\bs{\Lambda}\bs{Z}' = \bs{Z}_r\bs{\Lambda}_r\bs{Z}_r'\\ &= \begin{bmatrix} \bs{X} & \bs{X}\\ \bs{Y} & \bs{Y} \end{bmatrix} \begin{bmatrix} \bs{\Sigma} & \bs{0}\\ \bs{0} & \bs{\Sigma} \end{bmatrix} \begin{bmatrix} \bs{X} & \bs{X}\\ \bs{Y} & \bs{Y} \end{bmatrix}'\\ &= \begin{bmatrix} \bs{0} & \bs{X}\bs{\Sigma}\bs{Y}'\\ \bs{Y}\bs{\Sigma}\bs{X}' & \bs{0} \end{bmatrix}. \end{align} \]
Therefore, we conclude \(\bs{A}=\bs{X}\bs{\Sigma}\bs{Y}'\) where now all we need to prove is the orthogonality of \(\bs{X}\) and \(\bs{Y}\). Let’s take a look at \(\bs{z}=(\bs{x},\bs{y})\) we just defined. Let
\[ \norm{\bs{z}}=\bs{z}'\bs{z}=\bs{x}'\bs{x} + \bs{y}'\bs{y} = 2. \]
From orthogonality of eigenvectors corresponding to different eigenvalues, we also know
\[ \bs{z}'\bar{\bs{z}} = \bs{x}'\bs{x}  \bs{y}'\bs{y} = 0 \]
which altogether gives \(\norm{\bs{x}}=\norm{\bs{y}}=1\).Q.E.D.
Properties of SVD
Property
There are several characteristics we have about SVD:
 The \(\bs{W}\) decomposition above.
 The left singular vector \(\bs{v}\) given by \(\bs{Au} = \sigma \bs{v}\), and the right singular vector \(\bs{u}\) given by \(\bs{A}'\bs{v} = \sigma \bs{u}\).
 Relationship with eigenvectors/eigenvalues…
 of \(\bs{A}'\bs{A}\): \(\bs{A}'\bs{A}\bs{u} = \sigma\bs{A}'\bs{v} = \sigma^2\bs{u}\).
 of \(\bs{AA}'\): \(\bs{AA}'\bs{v} = \sigma\bs{A}\bs{u} = \sigma^2\bs{v}\).
 Frobenius norms (eigenvalues cannot define a norm!):
 \(\norm{\bs{A}}_F^2 = \sum_{i,j=1}^{m,n}a_{ij}^2=\sum_{i=1}^r\sigma_i^2\).
 \(\norm{\bs{A}}_2 = \max_{\bs{x}\neq 0} \norm{\bs{Ax}}_2 / \norm{\bs{x}}_2 = \sigma_1\).
Exercise
Show how to use SVD to calculate these two norms.
Applications of SVD

One of the most importance usages of SVD is computing projections. \(\bs{P}\in\R^{n\times n}\) is a projection matrix iff. \(\bs{P}^2=\bs{P}\). More commonly, we consider orthogonal projection \(\bs{P}\) that’s also symmetric. Now let’s consider dataset \(\bs{A}\in\R^{m\times n}\) where we have \(n\) observations, each with \(m\) dimensions. Suppose we want to project this dataset onto \(\bs{W}\subseteq\R^m\) that has \(k\) dimensions, i.e.
\[ \bs{W} = \span\{\bs{q}_1,\bs{q}_2,\ldots,\bs{q}_k\},\quad \bs{q}_i'\bs{q}_j=\1{i=j} \]
then the projection matrix would be \(\bs{P}_{\bs{W}}=\bs{Q}_k\bs{Q}_k'\).
 The nearest orthogonal matrix of \(\bs{A}\in\R^{p\times p}\) is given by
\[ \min_{\bs{X}'\bs{X}=\bs{I}}\norm{\bs{A}\bs{X}}_F \]
which solves if we have optima for
\[ \begin{align} \min_{\bs{X}'\bs{X}=\bs{I}}\norm{\bs{A}\bs{X}}_F^2 &= \min_{\bs{X}'\bs{X}=\bs{I}}\tr[(\bs{A}\bs{X})'(\bs{A}\bs{X})]\\&= \min_{\bs{X}'\bs{X}=\bs{I}}\tr[\bs{A}'\bs{A}  \bs{X}'\bs{A}  \bs{A}'\bs{X} + \bs{X}'\bs{X}]\\&= \min_{\bs{X}'\bs{X}=\bs{I}} \norm{\bs{A}}_F^2  \tr(\bs{A}'\bs{X})  \tr(\bs{X}'\bs{A}) + \tr(\bs{X}'\bs{X})\\&= \norm{\bs{A}}_F^2 + n  2\max_{\bs{X}'\bs{X}=\bs{I}} \tr(\bs{A}'\bs{X}) \end{align}. \]
Now we try to solve
\[ \max_{\bs{X}'\bs{X}=\bs{I}} \tr(\bs{A}'\bs{X}) \]
and claim the solution is given by \(\bs{X} = \bs{U}\bs{V}'\) where \(\bs{U}\) and \(\bs{V}\) are derived from SVD of \(\bs{A}\), namely \(\bs{A} = \bs{U\Sigma V}'\). Proof: We know
\[ \tr(\bs{A}'\bs{X}) = \tr(\bs{V}\bs{\Sigma}'\bs{U}'X) = \tr(\bs{\Sigma}'\bs{U}'\bs{X}\bs{V}) =: \tr(\bs{\Sigma}'\bs{Z}) \]
where we define \(\bs{Z}\) as the product of the three orthogonal matrices, which therefore is orthogonal: \(\bs{Z}'\bs{Z}=\bs{I}\).
Orthogonality of \(\bs{Z}\) gives \(\forall i\)
\[ z_{i1}^2 + z_{i2}^2 + \cdot + z_{ip}^2 = 1 \Rightarrow z_{ii} \ge 1. \]
Hence, (note all singular values are nonnegative)
\[ \tr(\bs{\Sigma}'\bs{Z}) = \sum_{i=1}^p \sigma_i z_{ii} \le \sum_{i=1}^p \sigma_i \]
which gives optimal \(\bs{Z}^*=\bs{I}\) and thus the solution follows.
 The orthogonal Procrustes problem seeks the solution to
\[ \min_{\bs{X}'\bs{X}=\bs{I}}\norm{\bs{A}\bs{BX}}_F \]
which is, similar to the problem above, given by the SVD of \(\bs{BA}'=\bs{U\Sigma V}'\), namely \(\bs{X}=\bs{UV}'\).

Nearest symmetric matrix for \(\bs{A}\in\R^{p\times p}\) seeks solution to \(\min\norm{\bs{A}\bs{X}}_F\), which is simply
\[ \bs{X} = \frac{\bs{A}+\bs{A}'}{2}. \]
In order to prove it, write \(\bs{A}\) in the form of
\[ \bs{A} = \frac{\bs{A} + \bs{A}'}{2} + \frac{\bs{A}  \bs{A}'}{2} =: \bs{X} + \bs{Y}. \]
Notice \(\tr(\bs{X}'\bs{Y})=0\), hence by Pythagoras we know \(\bs{Y}\) is the minimum we can find for the problem above.
 Best rank\(r\) approximation. In order to find the best rank\(r\) approximation in Frobenius norm, we need solution to
\[ \min_{\rank(\bs{X})\le r} \norm{\bs{A}\bs{X}}_F \]
which is merely \(\bs{X}=\bs{U}_r\bs{\Sigma}_r\bs{V}_r'\). See condensed SVD above for notation.
The best approximation in 2norm, namely solution to
\[ \min_{\rank(\bs{X})\le r} \norm{\bs{A}\bs{X}}_2, \]
is exactly identical to the one above. We may prove both by reduction to absurdity. Proof: Suppose \(\exists \bs{B}\in\R^{n\times p}\) s.t.
\[ \norm{\bs{A}\bs{B}}_2 < \norm{\bs{A}\bs{X}}_2 = \sigma_{r+1}. \]
Now choose \(\bs{w}\) from kernel of \(\bs{B}\) and we have
\[ \bs{Aw}=\bs{Aw}+\bs{0} = (\bs{A}\bs{B})\bs{w} \]
and thus
\[ \norm{\bs{Aw}}_2 = \norm{(\bs{A}\bs{B})\bs{w}}_2 \le \norm{\bs{A}\bs{B}}_2\cdot \norm{\bs{w}}_2 <\sigma_{r+1}\norm{\bs{w}}_2\tag{1}. \]
Meanwhile, note \(\bs{w}\in\span\{v_1,v_2,\ldots,v_{r+1}\}=\bs{W}\), assume particularly \(\bs{w}=\bs{v}_{r+1}\bs{\alpha}\), then
\[ \begin{align} \norm{\bs{Aw}}_2^2&=\norm{\bs{U}\bs{\Sigma}\bs{V}'\bs{w}}_2^2 = \sum_{i=1}^{r+1}\sigma_i^2\alpha_i^2 \ge \sigma_{r+1}^2\sum_{i=1}^{r+1}\alpha_i^2\\ &= \sigma_{r+1}^2\norm{\bs{\alpha}}_2^2\equiv \sigma_{r+1}^2\norm{\bs{w}}_2^2.\tag{2} \end{align} \]
Due to contradiction between eq. (1) and (2) we conclude such \(\bs{B}\) doesn’t exist.