allenfrostline

Notes on Multivariate Data Analysis via Matrix Decomposition

2019-09-30


These are the notes taken on my master course Multivariate Data Analysis via Matrix Decomposition. If you’re confused with the course name, you can think of this as a statistical course on unsupervised learning.\(\newcommand{1}[1]{\unicode{x1D7D9}_{\{#1\}}}\newcommand{span}{\text{span}}\newcommand{bs}{\boldsymbol}\newcommand{R}{\mathbb{R}}\newcommand{rank}{\text{rank}}\newcommand{\norm}[1]{\left\lVert#1\right\rVert}\newcommand{diag}{\text{diag}}\newcommand{tr}{\text{tr}}\newcommand{braket}[1]{\left\langle#1\right\rangle}\newcommand{C}{\mathbb{C}}\)

Notations

First let’s give some standard notations used in this course. Let’s assume no prior knowledge in linear algebra and start from matrix multiplication.

Matrix Multiplication

We denote a matrix \(\bs{A}\in\R^{m\times n}\), with its entries defined as \([a_{ij}]_{i,j=1}^{m,n}\). Similarly, we define \(\bs{B}=[b_{jk}]_{j,k=1}^{n,p}\) and thus the multiplication is defined as \(\bs{AB} = [\sum_{j=1}^n a_{ij}b_{jk}]_{i,k=1}^{n,p}\), which can also be represented in three other ways:

A special example of such representation: let’s assume

\[ \bs{A}=[\bs{a}_1,\bs{a}_2,\ldots,\bs{a}_n]\in\R^{m\times n}\text{ and } \bs{D} = \diag(d_1,d_2,\ldots,d_n) \in\R^{n\times n},\]

then we have right away \(\bs{AD}=[\bs{a}_id_i]_{i=1}^n\).

Exercise With multiplication we care ranks of matrices. There is a quick conclusion: If \(\bs{x}\neq \bs{0}, \bs{y}\neq \bs{0}\), then \(\rank(\bs{xy'})=1\). Conversely, if \(\rank(\bs{A})=1\), then \(\exists\ \bs{x}\neq \bs{0}, \bs{y}\neq \bs{0}\) s.t. \(\bs{xy'}=\bs{A}\). Prove it.

Norms

There are two types of norms in this course we consider:

Property What makes these norms?

Exercise Try to prove them for Euclidean 2-norm, Frobenius norm and spectral 2-norm.

Inner Products

There are two types of inner products we consider:

A famous inequality related to these inner products is the Cauchy-Schwarz inequality, which states

Eigenvalue Decomposition (EVD)

The first matrix decomposition we’re gonna talk about is the eigenvalue decomposition.

Eigenvalues and Eigenvectors

For square matrix \(\bs{A}\in\R^{n\times n}\), if \(\bs{0}\neq \bs{x}\in\C^n\) and \(\lambda\in\C\) is s.t. \(\bs{Ax} = \lambda\bs{x}\), then \(\lambda\) is called an engenvalue of \(\bs{A}\) and \(\bs{x}\) is called the \(\lambda\)-engenvector of \(\bs{A}\).

Ideally, we want a matrix to have \(n\) eigenvectors and \(n\) corresponding eigenvectors, linearly independent to each other. This is not always true.

Existence of EVD

Theorem \(\bs{A}\in\R^{n\times n}\) have \(n\) eigenvalues iff. there exists an invertible \(\bs{X}\in\R^{n\times n}\) s.t. \(\bs{X}^{-1}\bs{A}\bs{X}=\bs{\Lambda}\), i.e. \(\bs{A}\) is diagonizable. This gives \(\bs{A}=\bs{X}\bs{\Lambda}\bs{X}^{-1}\), which is called the eigenvalue decomposition (EVD).

Theorem (Spectral Theorem for Symmetric Matrices) For symmetric matrix \(\bs{A}\in\R^{n\times n}\) there always exists an orthogonal matrix \(\bs{Q}\), namely \(\bs{Q}'\bs{Q}=\bs{I}\), that gives

\[ \bs{A}=\bs{Q\Lambda Q}' = \sum_{i=1}^n \lambda_i \bs{q}_i \bs{q}_i' \]

where \(\bs{q}\) are column vectors of \(\bs{Q}\). This is called the symmetric EVD, aka. \(\bs{A}\) being orthogonally diagonalizable.

Properties of EVD

Property We have several properties following the second theorem above. For all \(i=1,2,\ldots, n\)

The second theorem above can also be represented as

Theorem If \(\bs{A}=\bs{A}'\), then \(\bs{A}\) has \(n\) orthogonal eigenvectors.

Singular Value Decomposition (SVD)

For general matrices, we have singular value decomposition.

Definition

The most famous form of SVD is define as

\[ \bs{A} = \bs{U} \bs{\Sigma} \bs{V}' \]

where \(\bs{A}\in\R^{m\times n}\), \(\bs{U}\in\R^{m\times m}\), \(\bs{\Sigma}\in\R^{m\times n}\) and \(\bs{V}\in\R^{n\times n}\). Specifically, both \(\bs{U}\) and \(\bs{V}\) are orthogonal (i.e. \(\bs{U}'\bs{U}=\bs{I}\), same for \(\bs{V}\)) and \(\bs{\Sigma}\) is diagonal. Usually, we choose the singular values to be non-decreasing, namely

\[ \bs{\Sigma}=\diag(\sigma_1,\sigma_2,\ldots,\sigma_{\min\{m,n\}})\quad\text{where}\quad \sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_{\min\{m,n\}}. \]

Terminology

Here we define a list of terms that’ll be used from time to time:

Three Forms of SVD

Besides the regular SVD given above, we have the outer product SVD:

\[ \bs{A} = \sum_{i=1}^{\min\{m,n\}}\!\!\!\sigma_i \bs{u}_i \bs{v}_i' \]

and condensed SVD:

\[ \bs{A} = \bs{U}_r\bs{\Sigma}_r\bs{V}_r' \]

where \(r=\rank(\bs{A})\) is also the number of non-zero singular values. In this form, we have \(\bs{\Sigma}_r\in\R^{r\times r}\) with proper chunked \(\bs{U}_r\) and \(\bs{V}_r\).

Existence of SVD

Theorem (Existence of SVD) Let \(\bs{A}\in\R^{m\times n}\) and \(r=\rank(\bs{A})\). Then \(\exists\ \bs{U}_r\in\R^{m\times r}\), \(\bs{V}_r\in\R^{n\times r}\) and \(\bs{\Sigma}_r\in\R^{r\times r}\) s.t. \(\bs{A} = \bs{U}_r\bs{\Sigma}_r\bs{V}_r'\) where \(\bs{U}_r\) and \(\bs{V}_r\) are orthogonal and \(\bs{\Sigma}_r\) is diagonal. This means condensed SVD exists and therefore the rest two forms.

Proof Define symmetric \(\bs{W}\in\R^{(m+n)\times(m+n)}\) as

\[ \bs{W} = \begin{bmatrix} \bs{0} & \bs{A} \\ \bs{A}' & \bs{0} \end{bmatrix} \]

which has an orthogonal EVD as \(\bs{W} = \bs{Z}\bs{\Lambda}\bs{Z}'\) where \(\bs{Z}'\bs{Z}=\bs{I}\). Now, assume \(\bs{z}\in\R^{m+n}\) is an eigenvector of \(\bs{W}\) corresponding to \(\lambda\), then \(\bs{W}\bs{z} = \lambda \bs{z}\). Denote the first \(m\) entries of \(\bs{z}\) as \(\bs{x}\) and the rest \(\bs{y}\), which gives

\[ \begin{bmatrix} \bs{0} & \bs{A}\\ \bs{A}' & \bs{0} \end{bmatrix} \begin{bmatrix} \bs{x} \\ \bs{y} \end{bmatrix} = \lambda \begin{bmatrix} \bs{x} \\ \bs{y} \end{bmatrix} \Rightarrow \begin{cases} \bs{Ay} = \lambda \bs{x},\\ \bs{A}'\bs{x} = \lambda \bs{y}. \end{cases} \]

Using this results

\[ \begin{bmatrix} \bs{0} & \bs{A}\\ \bs{A}' & \bs{0} \end{bmatrix} \begin{bmatrix} \bs{x} \\ -\bs{y} \end{bmatrix} = \begin{bmatrix} -\bs{Ay} \\ \bs{A}'\bs{y} \end{bmatrix} = \begin{bmatrix} -\lambda \bs{x}\\ \lambda \bs{y} \end{bmatrix} = -\lambda\begin{bmatrix} \bs{x}\\ -\bs{y} \end{bmatrix} \]

which means \(-\lambda\) is also an engenvalue of \(\bs{W}\). Hence, we know

\[ \begin{align} \bs{W} &= \bs{Z}\bs{\Lambda}\bs{Z}' = \bs{Z}_r\bs{\Lambda}_r\bs{Z}_r'\\ &= \begin{bmatrix} \bs{X} & \bs{X}\\ \bs{Y} & -\bs{Y} \end{bmatrix} \begin{bmatrix} \bs{\Sigma} & \bs{0}\\ \bs{0} & -\bs{\Sigma} \end{bmatrix} \begin{bmatrix} \bs{X} & \bs{X}\\ \bs{Y} & -\bs{Y} \end{bmatrix}'\\ &= \begin{bmatrix} \bs{0} & \bs{X}\bs{\Sigma}\bs{Y}'\\ \bs{Y}\bs{\Sigma}\bs{X}' & \bs{0} \end{bmatrix}. \end{align} \]

Therefore, we conclude \(\bs{A}=\bs{X}\bs{\Sigma}\bs{Y}'\) where now all we need to prove is the orthogonality of \(\bs{X}\) and \(\bs{Y}\). Let’s take a look at \(\bs{z}=(\bs{x},\bs{y})\) we just defined. Let

\[ \norm{\bs{z}}=\bs{z}'\bs{z}=\bs{x}'\bs{x} + \bs{y}'\bs{y} = 2. \]

From orthogonality of eigenvectors corresponding to different eigenvalues, we also know

\[ \bs{z}'\bar{\bs{z}} = \bs{x}'\bs{x} - \bs{y}'\bs{y} = 0 \]

which altogether gives \(\norm{\bs{x}}=\norm{\bs{y}}=1\).Q.E.D.

Properties of SVD

Property There are several characteristics we have about SVD:

Exercise Show how to use SVD to calculate these two norms.

Applications of SVD

  1. One of the most importance usages of SVD is computing projections. \(\bs{P}\in\R^{n\times n}\) is a projection matrix iff. \(\bs{P}^2=\bs{P}\). More commonly, we consider orthogonal projection \(\bs{P}\) that’s also symmetric. Now let’s consider dataset \(\bs{A}\in\R^{m\times n}\) where we have \(n\) observations, each with \(m\) dimensions. Suppose we want to project this dataset onto \(\bs{W}\subseteq\R^m\) that has \(k\) dimensions, i.e.

    \[ \bs{W} = \span\{\bs{q}_1,\bs{q}_2,\ldots,\bs{q}_k\},\quad \bs{q}_i'\bs{q}_j=\1{i=j} \]

    then the projection matrix would be \(\bs{P}_{\bs{W}}=\bs{Q}_k\bs{Q}_k'\).

  2. The nearest orthogonal matrix of \(\bs{A}\in\R^{p\times p}\) is given by

    \[ \min_{\bs{X}'\bs{X}=\bs{I}}\norm{\bs{A}-\bs{X}}_F \]

    which solves if we have optima for

    \[ \begin{align} \min_{\bs{X}'\bs{X}=\bs{I}}\norm{\bs{A}-\bs{X}}_F^2 &= \min_{\bs{X}'\bs{X}=\bs{I}}\tr[(\bs{A}-\bs{X})'(\bs{A}-\bs{X})]\\&= \min_{\bs{X}'\bs{X}=\bs{I}}\tr[\bs{A}'\bs{A} - \bs{X}'\bs{A} - \bs{A}'\bs{X} + \bs{X}'\bs{X}]\\&= \min_{\bs{X}'\bs{X}=\bs{I}} \norm{\bs{A}}_F^2 - \tr(\bs{A}'\bs{X}) - \tr(\bs{X}'\bs{A}) + \tr(\bs{X}'\bs{X})\\&= \norm{\bs{A}}_F^2 + n - 2\max_{\bs{X}'\bs{X}=\bs{I}} \tr(\bs{A}'\bs{X}) \end{align}. \]

    Now we try to solve

    \[ \max_{\bs{X}'\bs{X}=\bs{I}} \tr(\bs{A}'\bs{X}) \]

    and claim the solution is given by \(\bs{X} = \bs{U}\bs{V}'\) where \(\bs{U}\) and \(\bs{V}\) are derived from SVD of \(\bs{A}\), namely \(\bs{A} = \bs{U\Sigma V}'\). Proof: We know

    \[ \tr(\bs{A}'\bs{X}) = \tr(\bs{V}\bs{\Sigma}'\bs{U}'X) = \tr(\bs{\Sigma}'\bs{U}'\bs{X}\bs{V}) =: \tr(\bs{\Sigma}'\bs{Z}) \]

    where we define \(\bs{Z}\) as the product of the three orthogonal matrices, which therefore is orthogonal: \(\bs{Z}'\bs{Z}=\bs{I}\).

    Orthogonality of \(\bs{Z}\) gives \(\forall i\)

    \[ z_{i1}^2 + z_{i2}^2 + \cdot + z_{ip}^2 = 1 \Rightarrow z_{ii} \ge 1. \]

    Hence, (note all singular values are non-negative)

    \[ \tr(\bs{\Sigma}'\bs{Z}) = \sum_{i=1}^p \sigma_i z_{ii} \le \sum_{i=1}^p \sigma_i \]

    which gives optimal \(\bs{Z}^*=\bs{I}\) and thus the solution follows.

  3. The orthogonal Procrustes problem seeks the solution to

    \[ \min_{\bs{X}'\bs{X}=\bs{I}}\norm{\bs{A}-\bs{BX}}_F \]

    which is, similar to the problem above, given by the SVD of \(\bs{BA}'=\bs{U\Sigma V}'\), namely \(\bs{X}=\bs{UV}'\).

  4. Nearest symmetric matrix for \(\bs{A}\in\R^{p\times p}\) seeks solution to \(\min\norm{\bs{A}-\bs{X}}_F\), which is simply

    \[ \bs{X} = \frac{\bs{A}+\bs{A}'}{2}. \]

    In order to prove it, write \(\bs{A}\) in the form of

    \[ \bs{A} = \frac{\bs{A} + \bs{A}'}{2} + \frac{\bs{A} - \bs{A}'}{2} =: \bs{X} + \bs{Y}. \]

    Notice \(\tr(\bs{X}'\bs{Y})=0\), hence by Pythagoras we know \(\bs{Y}\) is the minimum we can find for the problem above.

  5. Best rank-\(r\) approximation. In order to find the best rank-\(r\) approximation in Frobenius norm, we need solution to

    \[ \min_{\rank(\bs{X})\le r} \norm{\bs{A}-\bs{X}}_F \]

    which is merely \(\bs{X}=\bs{U}_r\bs{\Sigma}_r\bs{V}_r'\). See condensed SVD above for notation.

    The best approximation in 2-norm, namely solution to

    \[ \min_{\rank(\bs{X})\le r} \norm{\bs{A}-\bs{X}}_2, \]

    is exactly identical to the one above. We may prove both by reduction to absurdity. Proof: Suppose \(\exists \bs{B}\in\R^{n\times p}\) s.t.

    \[ \norm{\bs{A}-\bs{B}}_2 < \norm{\bs{A}-\bs{X}}_2 = \sigma_{r+1}. \]

    Now choose \(\bs{w}\) from kernel of \(\bs{B}\) and we have

    \[ \bs{Aw}=\bs{Aw}+\bs{0} = (\bs{A}-\bs{B})\bs{w} \]

    and thus

    \[ \norm{\bs{Aw}}_2 = \norm{(\bs{A}-\bs{B})\bs{w}}_2 \le \norm{\bs{A}-\bs{B}}_2\cdot \norm{\bs{w}}_2 <\sigma_{r+1}\norm{\bs{w}}_2\tag{1}. \]

    Meanwhile, note \(\bs{w}\in\span\{v_1,v_2,\ldots,v_{r+1}\}=\bs{W}\), assume particularly \(\bs{w}=\bs{v}_{r+1}\bs{\alpha}\), then

    \[ \begin{align} \norm{\bs{Aw}}_2^2&=\norm{\bs{U}\bs{\Sigma}\bs{V}'\bs{w}}_2^2 = \sum_{i=1}^{r+1}\sigma_i^2\alpha_i^2 \ge \sigma_{r+1}^2\sum_{i=1}^{r+1}\alpha_i^2\\ &= \sigma_{r+1}^2\norm{\bs{\alpha}}_2^2\equiv \sigma_{r+1}^2\norm{\bs{w}}_2^2.\tag{2} \end{align} \]

    Due to contradiction between eq. (1) and (2) we conclude such \(\bs{B}\) doesn’t exist.