allenfrostline

Notes on Multivariate Data Analysis via Matrix Decomposition

2019-09-30


These are the notes taken on my master course Multivariate Data Analysis via Matrix Decomposition. If you’re confused with the course name, you can think of this as a statistical course on unsupervised learning. \(\newcommand{1}[1]{\unicode{x1D7D9}_{\{#1\}}} \newcommand{Corr}{\text{Corr}} \newcommand{E}{\text{E}} \newcommand{Cov}{\text{Cov}} \newcommand{Var}{\text{Var}} \newcommand{span}{\text{span}} \newcommand{bs}{\boldsymbol} \newcommand{R}{\mathbb{R}} \newcommand{rank}{\text{rank}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{diag}{\text{diag}} \newcommand{tr}{\text{tr}} \newcommand{braket}[1]{\left\langle#1\right\rangle} \newcommand{C}{\mathbb{C}}\)

Notations

First let’s give some standard notations used in this course. Let’s assume no prior knowledge in linear algebra and start from matrix multiplication.

Matrix Multiplication

We denote a matrix \(\bs{A}\in\R^{m\times n}\), with its entries defined as \([a_{ij}]_{i,j=1}^{m,n}\). Similarly, we define \(\bs{B}=[b_{jk}]_{j,k=1}^{n,p}\) and thus the multiplication is defined as \(\bs{AB} = [\sum_{j=1}^n a_{ij}b_{jk}]_{i,k=1}^{n,p}\), which can also be represented in three other ways:

A special example of such representation: let’s assume

\[ \bs{A}=[\bs{a}_1,\bs{a}_2,\ldots,\bs{a}_n]\in\R^{m\times n}\text{ and } \bs{D} = \diag(d_1,d_2,\ldots,d_n) \in\R^{n\times n},\]

then we have right away \(\bs{AD}=[\bs{a}_id_i]_{i=1}^n\).

Exercise With multiplication we care ranks of matrices. There is a quick conclusion: If \(\bs{x}\neq \bs{0}, \bs{y}\neq \bs{0}\), then \(\rank(\bs{yx'})=1\). Conversely, if \(\rank(\bs{A})=1\), then \(\exists\ \bs{x}\neq \bs{0}, \bs{y}\neq \bs{0}\) s.t. \(\bs{xy'}=\bs{A}\). Prove it.

Norms

There are two types of norms in this course we consider:

Properties of these norms:

Exercise Try to prove them for Euclidean 2-norm, Frobenius norm and spectral 2-norm.

Inner Products

There are two types of inner products we consider:

A famous inequality related to these inner products is the Cauchy-Schwarz inequality, which states

Eigenvalue Decomposition (EVD)

The first matrix decomposition we’re gonna talk about is the eigenvalue decomposition.

Eigenvalues and Eigenvectors

For square matrix \(\bs{A}\in\R^{n\times n}\), if \(\bs{0}\neq \bs{x}\in\C^n\) and \(\lambda\in\C\) is s.t. \(\bs{Ax} = \lambda\bs{x}\), then \(\lambda\) is called an engenvalue of \(\bs{A}\) and \(\bs{x}\) is called the \(\lambda\)-engenvector of \(\bs{A}\).

Ideally, we want a matrix to have \(n\) eigenvectors and \(n\) corresponding eigenvectors, linearly independent to each other. This is not always true.

Existence of EVD

Theorem \(\bs{A}\in\R^{n\times n}\) have \(n\) eigenvalues iff. there exists an invertible \(\bs{X}\in\R^{n\times n}\) s.t. \(\bs{X}^{-1}\bs{A}\bs{X}=\bs{\Lambda}\), i.e. \(\bs{A}\) is diagonizable. This gives \(\bs{A}=\bs{X}\bs{\Lambda}\bs{X}^{-1}\), which is called the eigenvalue decomposition (EVD).

Theorem (Spectral Theorem for Symmetric Matrices) For symmetric matrix \(\bs{A}\in\R^{n\times n}\) there always exists an orthogonal matrix \(\bs{Q}\), namely \(\bs{Q}'\bs{Q}=\bs{I}\), that gives

\[ \bs{A}=\bs{Q\Lambda Q}' = \sum_{i=1}^n \lambda_i \bs{q}_i \bs{q}_i' \]

where \(\bs{q}\) are column vectors of \(\bs{Q}\). This is called the symmetric EVD, aka. \(\bs{A}\) being orthogonally diagonalizable.

Properties of EVD

We have several properties following the second theorem above. For all \(i=1,2,\ldots, n\)

The second theorem above can also be represented as

Theorem If \(\bs{A}=\bs{A}'\), then \(\bs{A}\) has \(n\) orthogonal eigenvectors.

Singular Value Decomposition (SVD)

For general matrices, we have singular value decomposition.

Definition

The most famous form of SVD is define as

\[ \bs{A} = \bs{U} \bs{\Sigma} \bs{V}' \]

where \(\bs{A}\in\R^{m\times n}\), \(\bs{U}\in\R^{m\times m}\), \(\bs{\Sigma}\in\R^{m\times n}\) and \(\bs{V}\in\R^{n\times n}\). Specifically, both \(\bs{U}\) and \(\bs{V}\) are orthogonal (i.e. \(\bs{U}'\bs{U}=\bs{I}\), same for \(\bs{V}\)) and \(\bs{\Sigma}\) is diagonal. Usually, we choose the singular values to be non-decreasing, namely

\[ \bs{\Sigma}=\diag(\sigma_1,\sigma_2,\ldots,\sigma_{\min\{m,n\}})\quad\text{where}\quad \sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_{\min\{m,n\}}. \]

Terminology

Here we define a list of terms that’ll be used from time to time:

Three Forms of SVD

Besides the regular SVD given above, we have the outer product SVD:

\[ \bs{A} = \sum_{i=1}^{\min\{m,n\}}\!\!\!\sigma_i \bs{u}_i \bs{v}_i' \]

and condensed SVD:

\[ \bs{A} = \bs{U}_r\bs{\Sigma}_r\bs{V}_r' \]

where \(r=\rank(\bs{A})\) is also the number of non-zero singular values. In this form, we have \(\bs{\Sigma}_r\in\R^{r\times r}\) with proper chunked \(\bs{U}_r\) and \(\bs{V}_r\).

Existence of SVD

Theorem (Existence of SVD) Let \(\bs{A}\in\R^{m\times n}\) and \(r=\rank(\bs{A})\). Then \(\exists\ \bs{U}_r\in\R^{m\times r}\), \(\bs{V}_r\in\R^{n\times r}\) and \(\bs{\Sigma}_r\in\R^{r\times r}\) s.t. \(\bs{A} = \bs{U}_r\bs{\Sigma}_r\bs{V}_r'\) where \(\bs{U}_r\) and \(\bs{V}_r\) are orthogonal and \(\bs{\Sigma}_r\) is diagonal. This means condensed SVD exists and therefore the rest two forms.

Proof. Define symmetric \(\bs{W}\in\R^{(m+n)\times(m+n)}\) as

\[ \bs{W} = \begin{bmatrix} \bs{0} & \bs{A} \\ \bs{A}' & \bs{0} \end{bmatrix} \]

which has an orthogonal EVD as \(\bs{W} = \bs{Z}\bs{\Lambda}\bs{Z}'\) where \(\bs{Z}'\bs{Z}=\bs{I}\). Now, assume \(\bs{z}\in\R^{m+n}\) is an eigenvector of \(\bs{W}\) corresponding to \(\lambda\), then \(\bs{W}\bs{z} = \lambda \bs{z}\). Denote the first \(m\) entries of \(\bs{z}\) as \(\bs{x}\) and the rest \(\bs{y}\), which gives

\[ \begin{bmatrix} \bs{0} & \bs{A}\\ \bs{A}' & \bs{0} \end{bmatrix} \begin{bmatrix} \bs{x} \\ \bs{y} \end{bmatrix} = \lambda \begin{bmatrix} \bs{x} \\ \bs{y} \end{bmatrix} \Rightarrow \begin{cases} \bs{Ay} = \lambda \bs{x},\\ \bs{A}'\bs{x} = \lambda \bs{y}. \end{cases} \]

Using this results

\[ \begin{bmatrix} \bs{0} & \bs{A}\\ \bs{A}' & \bs{0} \end{bmatrix} \begin{bmatrix} \bs{x} \\ -\bs{y} \end{bmatrix} = \begin{bmatrix} -\bs{Ay} \\ \bs{A}'\bs{y} \end{bmatrix} = \begin{bmatrix} -\lambda \bs{x}\\ \lambda \bs{y} \end{bmatrix} = -\lambda\begin{bmatrix} \bs{x}\\ -\bs{y} \end{bmatrix} \]

which means \(-\lambda\) is also an engenvalue of \(\bs{W}\). Hence, we know

\[ \begin{align} \bs{W} &= \bs{Z}\bs{\Lambda}\bs{Z}' = \bs{Z}_r\bs{\Lambda}_r\bs{Z}_r'\\ &= \begin{bmatrix} \bs{X} & \bs{X}\\ \bs{Y} & -\bs{Y} \end{bmatrix} \begin{bmatrix} \bs{\Sigma} & \bs{0}\\ \bs{0} & -\bs{\Sigma} \end{bmatrix} \begin{bmatrix} \bs{X} & \bs{X}\\ \bs{Y} & -\bs{Y} \end{bmatrix}'\\ &= \begin{bmatrix} \bs{0} & \bs{X}\bs{\Sigma}\bs{Y}'\\ \bs{Y}\bs{\Sigma}\bs{X}' & \bs{0} \end{bmatrix}. \end{align} \]

Therefore, we conclude \(\bs{A}=\bs{X}\bs{\Sigma}\bs{Y}'\) where now all we need to prove is the orthogonality of \(\bs{X}\) and \(\bs{Y}\). Let’s take a look at \(\bs{z}=(\bs{x},\bs{y})\) we just defined. Let

\[ \norm{\bs{z}}=\bs{z}'\bs{z}=\bs{x}'\bs{x} + \bs{y}'\bs{y} = 2. \]

From orthogonality of eigenvectors corresponding to different eigenvalues, we also know

\[ \bs{z}'\bar{\bs{z}} = \bs{x}'\bs{x} - \bs{y}'\bs{y} = 0 \]

which altogether gives \(\norm{\bs{x}}=\norm{\bs{y}}=1\).Q.E.D.

How to Calculate SVD

Three steps to calculate the SVD of \(\bs{A}\in\R^{m\times n}\):

Remark: Alternatively you may use formula \(\bs{U}=\bs{AV\Sigma}^{-1}\Rightarrow \bs{u}_i=\bs{Av}_i/\sigma_i\).

Properties of SVD

There are several characteristics we have about SVD:

Exercise Show how to use SVD to calculate these two norms.

Applications of SVD

1) Projections. One of the most importance usages of SVD is computing projections. \(\bs{P}\in\R^{n\times n}\) is a projection matrix iff. \(\bs{P}^2=\bs{P}\). More commonly, we consider orthogonal projection \(\bs{P}\) that’s also symmetric. Now let’s consider dataset \(\bs{A}\in\R^{m\times n}\) where we have \(n\) observations, each with \(m\) dimensions. Suppose we want to project this dataset onto \(\bs{W}\subseteq\R^m\) that has \(k\) dimensions, i.e.

\[ \bs{W} = \span\{\bs{q}_1,\bs{q}_2,\ldots,\bs{q}_k\},\quad \bs{q}_i'\bs{q}_j=\1{i=j} \]

then the projection matrix would be \(\bs{P}_{\bs{W}}=\bs{Q}_k\bs{Q}_k'\).

Nearest Orthogonal Matrix. The nearest orthogonal matrix of \(\bs{A}\in\R^{p\times p}\) is given by

\[ \min_{\bs{X}'\bs{X}=\bs{I}}\norm{\bs{A}-\bs{X}}_F \]

which solves if we have optima for

\[ \begin{align} \min_{\bs{X}'\bs{X}=\bs{I}}\norm{\bs{A}-\bs{X}}_F^2 &= \min_{\bs{X}'\bs{X}=\bs{I}}\tr[(\bs{A}-\bs{X})'(\bs{A}-\bs{X})]\\&= \min_{\bs{X}'\bs{X}=\bs{I}}\tr[\bs{A}'\bs{A} - \bs{X}'\bs{A} - \bs{A}'\bs{X} + \bs{X}'\bs{X}]\\&= \min_{\bs{X}'\bs{X}=\bs{I}} \norm{\bs{A}}_F^2 - \tr(\bs{A}'\bs{X}) - \tr(\bs{X}'\bs{A}) + \tr(\bs{X}'\bs{X})\\&= \norm{\bs{A}}_F^2 + n - 2\max_{\bs{X}'\bs{X}=\bs{I}} \tr(\bs{A}'\bs{X}) \end{align}. \]

Now we try to solve

\[ \max_{\bs{X}'\bs{X}=\bs{I}} \tr(\bs{A}'\bs{X}) \]

and claim the solution is given by \(\bs{X} = \bs{U}\bs{V}'\) where \(\bs{U}\) and \(\bs{V}\) are derived from SVD of \(\bs{A}\), namely \(\bs{A} = \bs{U\Sigma V}'\). Proof: We know

\[ \tr(\bs{A}'\bs{X}) = \tr(\bs{V}\bs{\Sigma}'\bs{U}'X) = \tr(\bs{\Sigma}'\bs{U}'\bs{X}\bs{V}) =: \tr(\bs{\Sigma}'\bs{Z}) \]

where we define \(\bs{Z}\) as the product of the three orthogonal matrices, which therefore is orthogonal: \(\bs{Z}'\bs{Z}=\bs{I}\).

Orthogonality of \(\bs{Z}\) gives \(\forall i\)

\[ z_{i1}^2 + z_{i2}^2 + \cdot + z_{ip}^2 = 1 \Rightarrow z_{ii} \ge 1. \]

Hence, (note all singular values are non-negative)

\[ \tr(\bs{\Sigma}'\bs{Z}) = \sum_{i=1}^p \sigma_i z_{ii} \le \sum_{i=1}^p \sigma_i \]

which gives optimal \(\bs{Z}^*=\bs{I}\) and thus the solution follows.

2) Orthogonal Procrustes Problem. This seeks the solution to

\[ \min_{\bs{X}'\bs{X}=\bs{I}}\norm{\bs{A}-\bs{BX}}_F \]

which is, similar to the problem above, given by the SVD of \(\bs{BA}'=\bs{U\Sigma V}'\), namely \(\bs{X}=\bs{UV}'\).

3) Nearest Symmetric Matrix for \(\bs{A}\in\R^{p\times p}\) seeks solution to \(\min\norm{\bs{A}-\bs{X}}_F\), which is simply

\[ \bs{X} = \frac{\bs{A}+\bs{A}'}{2}. \]

In order to prove it, write \(\bs{A}\) in the form of

\[ \bs{A} = \frac{\bs{A} + \bs{A}'}{2} + \frac{\bs{A} - \bs{A}'}{2} =: \bs{X} + \bs{Y}. \]

Notice \(\tr(\bs{X}'\bs{Y})=0\), hence by Pythagoras we know \(\bs{Y}\) is the minimum we can find for the problem above.

4) Best Rank-\(r\) Approximation. In order to find the best rank-\(r\) approximation in Frobenius norm, we need solution to

\[ \min_{\rank(\bs{X})\le r} \norm{\bs{A}-\bs{X}}_F \]

which is merely \(\bs{X}=\bs{U}_r\bs{\Sigma}_r\bs{V}_r'\). See condensed SVD above for notation.

The best approximation in 2-norm, namely solution to

\[ \min_{\rank(\bs{X})\le r} \norm{\bs{A}-\bs{X}}_2, \]

is exactly identical to the one above. We may prove both by reduction to absurdity. Proof: Suppose \(\exists \bs{B}\in\R^{n\times p}\) s.t.

\[ \norm{\bs{A}-\bs{B}}_2 < \norm{\bs{A}-\bs{X}}_2 = \sigma_{r+1}. \]

Now choose \(\bs{w}\) from kernel of \(\bs{B}\) and we have

\[ \bs{Aw}=\bs{Aw}+\bs{0} = (\bs{A}-\bs{B})\bs{w} \]

and thus

\[ \norm{\bs{Aw}}_2 = \norm{(\bs{A}-\bs{B})\bs{w}}_2 \le \norm{\bs{A}-\bs{B}}_2\cdot \norm{\bs{w}}_2 <\sigma_{r+1}\norm{\bs{w}}_2\tag{1}. \]

Meanwhile, note \(\bs{w}\in\span\{v_1,v_2,\ldots,v_{r+1}\}=\bs{W}\), assume particularly \(\bs{w}=\bs{v}_{r+1}\bs{\alpha}\), then

\[ \begin{align} \norm{\bs{Aw}}_2^2&=\norm{\bs{U}\bs{\Sigma}\bs{V}'\bs{w}}_2^2 = \sum_{i=1}^{r+1}\sigma_i^2\alpha_i^2 \ge \sigma_{r+1}^2\sum_{i=1}^{r+1}\alpha_i^2\\ &= \sigma_{r+1}^2\norm{\bs{\alpha}}_2^2\equiv \sigma_{r+1}^2\norm{\bs{w}}_2^2.\tag{2} \end{align} \]

Due to contradiction between eq. (1) and (2) we conclude such \(\bs{B}\) doesn’t exist.

Orthonormal Bases for Four Subspaces using SVD

SVD can be used to get orthonormal bases for each of the four subspaces: the column space \(C(\bs{A})\), the null space \(N(\bs{A})\), the row space \(C(\bs{A}')\), and the left null space \(N(\bs{A}')\).

See this post for detailed proof.

Principle Component Analysis (PCA)

PCA is the most important matrix analysis tool. In this section we use \(\bs{X}=(X_1,X_2,\ldots,X_p)\) to denote a vector of random variables. Being capitalized here means they are random variables rather than observations, and thus a capitalized bold symbol stands still for a vector.

Three Basic Formulas (for Population Analysis)

Expectation:

\[ \E[\bs{AX}] = \bs{A}\E[\bs{X}]. \]

Variance:

\[ \Var[\bs{Ax}] = \bs{A}\Var[\bs{X}]\bs{A}'. \]

Covariance:

\[ \Cov[\bs{a}'\bs{X},\bs{b}'\bs{X}] = \bs{a}'\Var[\bs{X}]\bs{b}. \]

Definition of Population PCA

Given \(\bs{X}:\Omega\to\R^p\), find \(\bs{A}\in\R^{p\times p}\) s.t.

These two condisions can be described in equations as below:

We choose Y_1 to maximize it's variance, and then Y_2

These \(Y_1,Y_2,\ldots,Y_p\) are called principle components of \(\bs{X}\).

How to Calculate Population PCs

We can do it recursively:

Or, we can do it analytically by the theorem below:

Theorem Let \(\bs{\Sigma}=\Cov[\bs{X}]\) and let the EVD be \(\bs{\Sigma} = \bs{A}\bs{\Lambda}\bs{A}'\), then it can be proved that \(Y_k = \bs{a}_k'\bs{X}\) is the \(k\)-th PC.

Properties of Population PCs

Definition of Sample PCA

Given \(n\) samples: \(\{X(\omega_1), X(\omega_2), \ldots, X(\omega_n)\}\). Everything is just the same except being in sample notations, e.g. \(\bar{\bs{X}}=\bs{X}'\bs{1} / n\) and \(S=(\bs{X}-\bar{\bs{X}}'\bs{1})'(\bs{X}-\bar{\bs{X}}'\bs{1}) / (n-1)\).

How to Calculate Sample PCA

In order to avoid loss of precision in calculating the \(\bs{S}\), we do SVD instead of EVD, on the mean-centered sample matrix \(\bs{X}-\bs{1}\bar{\bs{X}}=\bs{U\Sigma V}'\in\R^{n\times p}\). Then it can be proved that the \(k\)-th sample PC is given by \(\bs{v}_k\), \(k=1,2,\ldots,p\).

Remarks: In this case, we call \(\bs{V}\) the loading matrix, and \(\bs{U\Sigma}:=\bs{T}\) the score matrix. The PCA scatter plot is thereby the projection onto the PCs, i.e. the columns of \(\bs{V}\). Specifically, the coordinates are gonna be \(\{(t_{ij}, t_{ik})\in\R^2: i=1,2,\ldots, n\}\) namely the selected first columns of the score matrix. This is identical to manually projecting \(\bs{X}\) onto selected columns of \(\bs{Q}\) (the principle components) as it can be proved that \(t_{ij}=\bs{x}_i'\bs{q}_j\). However, by using SVD we avoid miss-calculating the eigenvalues in low precision systems.

Definition of Variable PCA

This is merely population PCA on \(\bs{X}'\in\R^{p\times n}\). Transposing \(\bs{X}\) swaps the roles of the number of variables and the size of population. The SVD now becomes

\[ \bs{X}' = \bs{V\Sigma U}' \]

where we now instead call \(\bs{V\Sigma}\) the \(\bs{T}\) variable.

Remarks: By plotting \(\bs{X}\) against \(\bs{V}\) we get PCA scatter plot; by plotting \(\bs{X}\) against \(\bs{U}\) on the same piece of paper where draw this PCA scatter plot, we get the so-called biplot.

Application of PCA: Sample Factor Analysis (FA)

One sentence to summarize it: sample factor analysis equals PCA. In formula, FA is trying to write \(\bs{X}\in\R^p\) into

\[ \bs{X} = \bs{\mu} + \bs{LF} + \bs{\varepsilon} \]

where \(\bs{\mu}\in\R^p\) are \(p\) means of features (aka. alphas), \(\bs{L}\in\R^{p\times m}\) are loadings (aka. betas) and \(\bs{F}\in\R^m\) are called the factors.

There are some assumptions:

With these assumptions we have \(\bs{\Sigma}=\bs{LL}'+\bs{\Xi}\).

Canonical Correlation Analysis (CCA)

Similar to PCA, we try to introduce CCA in two ways, namely w.r.t. a population and a sample.

Notations for Population CCA

Assume \(p\le q\) (important!!!). Given \(\bs{X}\in\R^p\) and \(\bs{Y}\in\R^q\), define

\[ \mu_{\bs{X}} = \E[\bs{X}]\in\R^p\quad\text{and}\quad \mu_{\bs{Y}} = \E[\bs{Y}]\in\R^q \]

and

\[ \bs{\Sigma}_{\bs{X}} = \Cov[\bs{X}]\in\R^{p\times p}\quad\text{and}\quad \bs{\Sigma}_{\bs{Y}} = \Cov[\bs{Y}]\in\R^{q\times q}. \]

Furthermore, define

\[ \bs{\Sigma}_{\bs{XY}} = \Cov[\bs{X},\bs{Y}]=\E[(\bs{X}-\mu_{\bs{X}})(\bs{Y}-\mu_{\bs{Y}})']\in\R^{p\times q} \]

and

\[ \bs{\Sigma}_{\bs{YX}} = \Cov[\bs{Y},\bs{X}]=\E[(\bs{Y}-\mu_{\bs{Y}})(\bs{X}-\mu_{\bs{X}})']\in\R^{q\times p}. \]

Also, let

\[ \bs{W} = \begin{bmatrix} \bs{X}\\\bs{Y} \end{bmatrix} \in \R^{p+q}. \]

Then with this notation we know given \(\bs{a}\in\R^p\) and \(\bs{b}\in\R^q\), how expectation, variance and covariance (ans thus correlation) are represented for \(U=\bs{a}'\bs{X}\) and \(V=\bs{b}'\bs{Y}\).

Definition of Population CCA

We calculate canonical correlation variables iteratively:

Theorem Suppose \(p\le q\), let \(\Gamma_{\bs{XY}}=\bs{\Sigma}_{\bs{X}}^{-1/2}\bs{\Sigma}_{\bs{XY}}\bs{\Sigma}_{\bs{Y}}^{-1/2}\in\R^{p\times q}\) and the condensed SVD be1

\[ \bs{\Gamma}_{\bs{XY}} = \begin{bmatrix} \bs{u}_1 & \bs{u}_2, & \ldots & \bs{u}_p \end{bmatrix}\begin{bmatrix} \sigma_1 & \cdots & \cdots & \bs{O} \\ \vdots & \sigma_2 & \cdots & \vdots \\ \vdots & \vdots & \ddots & \vdots \\ \bs{O} & \cdots & \cdots & \sigma_p \end{bmatrix}\begin{bmatrix} \bs{v}_1'\\ \bs{v}_2' \\ \vdots \\ \bs{v}_p' \end{bmatrix} \]

which gives

\[ \bs{U}_k = \bs{u}_k'\bs{\Sigma}_{\bs{X}}^{-1/2}\bs{X}\quad\text{and}\quad \bs{V}_k = \bs{v}_k'\bs{\Sigma}_{\bs{Y}}^{-1/2}\bs{Y} \]

and that \(\rho_k=\sigma_k\).

We call \(\rho_k=\Corr[U_k,V_k]\) as the \(k\)-th population canonical correlation. We call \(\bs{a}_k=\bs{\Sigma}_{\bs{X}}^{-1/2}\bs{u}_k\) and \(\bs{b}_k=\bs{\Sigma}_{\bs{Y}}^{-1/2}\bs{v}_k\) as the population canonical vectors.

Properties of Population CCA

The canonical correlation variables have the following three basic properties:

Theorem Let \(\tilde{\bs{X}}=\bs{MX}+\bs{c}\) be the affine transformation of \(\bs{X}\). Similarly let \(\tilde{\bs{Y}}=\bs{NY}+\bs{d}\). Then by using CCA, the results from analyzing \((\tilde{\bs{X}},\tilde{\bs{Y}})\) remains unchanged as \((\bs{X},\bs{Y})\), namely CCA is affine invariant.

Based on this theorem we have the following properties:

Example of Calculating Population CCA

Let’s assume \(p=q=2\), let

\[ \bs{P}_{\bs{X}} = \begin{bmatrix} 1 & \alpha \\ \alpha & 1 \end{bmatrix},\quad \bs{P}_{\bs{Y}} = \begin{bmatrix} 1 & \gamma\\ \gamma & 1 \end{bmatrix}\quad\text{and}\quad \bs{P}_{\bs{XY}} = \begin{bmatrix} \beta & \beta \\ \beta & \beta \end{bmatrix} \]

with \(|\alpha| < 1\), \(|\gamma| < 1\). In order to find the 1st canonical correction of \(\bs{X}\) and \(\bs{Y}\), we first check

\[ \det(\bs{P}_{\bs{X}}) = 1 - \alpha^2 > 0\quad\text{and}\quad \det(\bs{P}_{\bs{Y}}) = 1 - \gamma^2 > 0. \]

Therefore, we may calculate

\[ \bs{H}_{\bs{XY}} = \bs{P}_{\bs{X}}^{-1}\bs{P}_{\bs{XY}}\bs{P}_{\bs{Y}}^{-1}\bs{P}_{\bs{XY}} = \frac{\beta}{1-\alpha^2}\begin{bmatrix} 1 & -\alpha \\ -\alpha & 1 \end{bmatrix}\begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix} = \frac{2\beta^2}{(1+\alpha)(1+\gamma)}\bs{1}\bs{1}'. \]

It’s easy to show that \(\lambda_1= 2\) and \(\lambda_2=0\) are the two eigenvalues of \(\bs{11}'\) and thus the 1st canonical correction is

\[ \rho_1 = \sqrt{\frac{4\beta^2}{(1+\alpha)(1+\gamma)}} = \frac{2\beta}{\sqrt{(1+\alpha)(1+\gamma)}}. \]

Notations for Sample CCA

Given \(n\) samples: \(\bs{X}=\{\bs{X}(\omega_1), \bs{X}(\omega_2), \ldots, \bs{X}(\omega_n)\}\in\R^{n\times p}\) and similarly \(\bs{Y}\in\R^{n\times q}\). Everything is just the same except being in sample notations, e.g. \(\bar{\bs{X}}=\bs{X}'\bs{1} / n\) and \(S_{\bs{X}}=(\bs{X}-\bar{\bs{X}}'\bs{1})'(\bs{X}-\bar{\bs{X}}'\bs{1}) / (n-1)\).

Besides the regular notations, here we also define \(r_{\bs{XY}}(\bs{a},\bs{b})\) as the sample correlation of \(\bs{a}'\bs{X}\) and \(\bs{b}'\bs{Y}\):

\[ r_{\bs{XY}}(\bs{a},\bs{b}) = \frac{\bs{a}'\bs{S}_{\bs{XY}}\bs{b}}{\sqrt{\bs{a}'\bs{S}_{\bs{X}}\bs{a}}\sqrt{\bs{b}'\bs{S}_{\bs{Y}}\bs{b}}}. \]

Definition of Sample CCA

Same as population CCA, we give sample CCA iteratively (except that here we’re talking about canonical correlation vectors directly):

Theorem Given \(p\le q\), let \(\bs{G}_{\bs{XY}} = \bs{S}_{\bs{X}}^{-1/2}\bs{S}_{\bs{XY}}\bs{S}_{\bs{Y}}^{-1/2}\in\R^{p\times q} \) and the SVD be

\[ \bs{G}_{\bs{XY}} = \begin{bmatrix} \bs{u}_1 & \bs{u}_2 & \cdots & \bs{u}_p \end{bmatrix}\begin{bmatrix} \sigma_1 & \cdots & \cdots & \bs{O} \\ \vdots & \sigma_2 & \cdots & \vdots \\ \vdots & \vdots & \ddots & \vdots \\ \bs{O} & \cdots & \cdots & \sigma_p \end{bmatrix}\begin{bmatrix} \bs{v}_1' \\ \bs{v}_2' \\ \vdots \\ \bs{v}_p' \end{bmatrix}. \]

Then we have

\[ \hat{\bs{a}}_k = \bs{S}_{\bs{X}}^{-1/2} \bs{u}_k\in\R^p\quad \text{and}\quad \hat{\bs{b}}_k = \bs{S}_{\bs{Y}}^{-1/2} \bs{v}_k\in\R^q. \]

In addition,

\[ r_k = r_{\bs{XY}}(\hat{\bs{a}}_k, \hat{\bs{b}}_k) = \sigma_k. \]

We call \(r_k\) the \(k\)-th sample canonical correlation. Also, \(\bs{Xa}_k\) and \(\bs{Yb}_k\) are called the score vectors.

Properties of Sample CCA

Everything with population CCA, including the affine invariance, holds with sample CCA, too.

Example of Calculating Sample CCA

Let \(p=q=2\), let

\[ \bs{X}=\begin{bmatrix} \text{head length of son1}\\ \text{head breath of son1} \end{bmatrix}=\begin{bmatrix} l_1\\b_1 \end{bmatrix}\quad\text{and}\quad \bs{Y}=\begin{bmatrix} \text{head length of son2}\\ \text{head breath of son2} \end{bmatrix}=\begin{bmatrix} l_2\\b_2 \end{bmatrix}. \]

Assume there’re \(25\) families, each having \(2\) sons. The \(\bs{W}\) matrix is therefore

\[ \bs{W} = \begin{bmatrix} \bs{l}_1 & \bs{b}_1 & \bs{l}_2 & \bs{b}_2 \end{bmatrix}\in\R^{25\times 4}. \]

In addition, we may also calculate the correlations \(\bs{R}_{\bs{X}}\), \(\bs{R}_{\bs{Y}}\) and \(\bs{R}_{\bs{XY}}\), from which we have \(\bs{G}_{\bs{XY}}\) given by

\[ \bs{G}_{\bs{XY}} = \bs{R}_{\bs{X}}^{-1/2}\bs{R}_{\bs{XY}}\bs{R}_{\bs{Y}}^{-1/2} \]

which gives the sample canonical correlations \(r_1\) and \(r_2\) (in most cases \(r_1\ll r_2\)). In the meantime, we have \(\bs{u}_{1,2}\) and \(\bs{v}_{1,2}\) and because of significant difference in scale of \(r_1\) and \(r_2\), we don’t care about \(\bs{u}_2\) and \(\bs{v}_2\). From \(\bs{u}_1\) and \(\bs{v}_1\) we can calculate \(\hat{\bs{a}}_1\) and \(\hat{\bs{b}}_2\), which indicate the linear relationship between the features. Specifically, the high correlation \(r_1\) tells that \(\hat{U}_1=\hat{\bs{a}}_1\bs{X}\) and \(\hat{V}_1=\hat{\bs{b}}_1\bs{Y}\), which are essentially the “girths” (height \(+\) breath) of sons’ faces, are highly correlated. In comparison, data shows that \(\hat{U}_2\) and \(\hat{V}_2\), which describes the “shapes” (height \(-\) breath) of sons’ faces, are poorly correlated.

Linear Discriminant Analysis (LDA)

Different from the unsupervised learning algorithms we’ve been discussing in the previous sections, like PCA, FA and CCA, LDA is a supervised classification method in that the number of classes to be divided into is specified explicitly.

Notations for LDA

Assume \(g\) different classes \(C_1,C_2,\ldots,C_g\). We try to solve the problem asking for the optimal division of \(n\) rows of \(\bs{X}\in\R^{n\times p}\) into \(g\) parts: for each \(i=1,2,\ldots,g\) denote

\[ \bs{X}_i = \begin{bmatrix} \bs{x}_{i,1}'\\ \bs{x}_{i,2}'\\ \vdots\\ \bs{x}_{i,n_i}' \end{bmatrix}\in\R^{n_i\times p} \]

with \(\sum_{i=1}^g n_i=g\), then given \(\bs{a}\in\R^p\) we can define

\[ \bs{X}_i\bs{a} =: \bs{y}_i \in \R^{n_i},\quad i=1,2,\ldots,g. \]

Now, recall the mean-centering matrix

\[ \bs{H} = \bs{I} - \frac{\bs{\iota}\bs{\iota}'}{\bs{\iota}'\bs{\iota}} \]

which provides handy feature that centers a vector by its mean: for any \(\bs{X}\)

\[ \bs{HX} = \bs{X} - \bs{\iota}'\bar{\bs{X}}. \]

With \(\bs{H}\) we also have the total sum-of-squares as

\[ \bs{T} = \bs{X}'\bs{HX} = \bs{X}'\left(\bs{I}-\frac{\bs{\iota}\bs{\iota}'}{\bs{\iota}'\bs{\iota}}\right)\bs{X} = (n-1)\bs{S}. \]

Similarly we define \(\bs{H}_i\) and \(\bs{W}_i=\bs{X}_i'\bs{H}_i\bs{X}_i\) for each \(i=1,2,\ldots,g\) and thus

\[ \sum_{i=1}^g \bs{y}_i'\bs{H}_i\bs{y}_i = \sum_{i=1}^g \bs{a}'\bs{X}_i'\bs{H}_i\bs{X}_i\bs{a} = \bs{a}' \left(\sum_{i=1}^g \bs{X}_i'\bs{H}_i\bs{X}_i\right) \bs{a} = \bs{a}'\sum_{i=1}^g \bs{W}_i\bs{a} =:\bs{a}'\bs{W}\bs{a} \]

where \(\bs{W}\in\R^{p\times p}\) is known as the within-group sum-of-squares. Finally, check

\[ \begin{align} \sum_{i=1}^g n_i (\bar{\bs{y}}_i-\bar{\bs{y}})^2 &= \sum_{i=1}^g n_i (\bar{\bs{X}}_i\bs{a}-\bar{\bs{X}}\bs{a})^2 = \sum_{i=1}^g n_i (\bar{\bs{X}}_i\bs{a}-\bar{\bs{X}}\bs{a})'(\bar{\bs{X}}_i\bs{a}-\bar{\bs{X}}\bs{a}) \\&= \bs{a}' \left[\sum_{i=1}^g n_i (\bar{\bs{X}}_i - \bar{\bs{X}})'(\bar{\bs{X}}_i - \bar{\bs{X}})\right] \bs{a} =: \bs{a}'\bs{B}\bs{a} \end{align} \]

where we define \(\bs{B}\in\R^{p\times p}\) as the between-group sum-of-squares.

Theorem For any \(\bs{a}\in\R^p\), \(\bs{a}'\bs{T}\bs{a} = \bs{a}'\bs{B}\bs{a} + \bs{a}'\bs{W}\bs{a}\). However, \(\bs{T}\neq \bs{B}+\bs{W}\).

Definition of LDA

First let’s give the Fisher Linear Discriminant Function

\[ f(\bs{a}) = \frac{\bs{a}'\bs{B}\bs{a}}{\bs{a}'\bs{W}\bs{a}} \]

by maximizing which, Fisher states, gives the optimal classification.

Theorem Suppose \(\bs{W}\in\R^{p\times p}\) is nonsingular, let \(\bs{q}_1\in\R^p\) be the principle eigenvector of \(\bs{W}^{-1}\bs{B}\) corresponding to \(\lambda_1=\lambda_{\max}(\bs{W}^{-1}\bs{B})\), then it can be shown that \(\bs{q}_1=\arg\max_{\bs{a}}f(\bs{a})\) and \(\lambda_1=\max_{\bs{a}}f(\bs{a})\).

Proof. We know

\[ \max_{\bs{a}} \frac{\bs{a}'\bs{B}\bs{a}}{\bs{a}'\bs{W}\bs{a}} = \max_{\bs{a}'\bs{W}\bs{a}=1} \bs{a}'\bs{B}\bs{a} = \max_{\bs{b}'\bs{b}=1} \bs{b}'\bs{W}^{-1/2}\bs{B}\bs{W}^{-1/2}\bs{b} = \lambda_{\max}(\bs{W}^{-1/2}\bs{B}\bs{W}^{-1/2}) = \lambda_{\max}(\bs{W}^{-1}\bs{B}). \]

Therefore, we have the maximum \(\lambda_1\) and thereby we find the maxima \(\bs{q}_1\).Q.E.D.

Remark: \(\lambda_1\) and \(\bs{q}_1\) can also be seen as the solutions to

\[ \bs{B}\bs{x} = \lambda \bs{W}\bs{x},\quad \bs{x}\neq\bs{0} \]

which is known as a generalized eigenvalue problem since it reduces to the usual case when \(\bs{W}=\bs{I}\).

Classification Rule of LDA

Given a new data point \(\bs{t}\in\R^p\) we find

\[ i = \underset{j=1,2,\ldots,g}{\arg\min} |\bs{q}_1'(\bs{t}-\bar{\bs{X}_j})| \]

and assign \(\bs{t}\) to class \(C_j\). This essentially the heuristic behind Fisher’s LDA.

There is no general formulae for \(g\)-class LDA, but we do have one for specifically \(g=2\).

Population LDA for Binary Classification

The population linear discriminant function is

\[ f(\bs{a}) = \frac{\sigma_{\text{between}}}{\sigma_{\text{within}}} = \frac{(\bs{a}'\bs{\mu}_1 - \bs{a}'\bs{\mu}-2)^2}{\bs{a}'\bs{\Sigma}_1\bs{a} + \bs{a}'\bs{\Sigma}_2\bs{a}} = \frac{[\bs{a}'(\bs{\mu}_1-\bs{\mu}_2)]^2}{\bs{a}'(\bs{\Sigma}_1+\bs{\Sigma}_2)\bs{a}} \]

which solves to

\[ \bs{q}_1 = (\bs{\Sigma}_1 + \bs{\Sigma}_2)^{-1}(\bs{\mu}_1-\bs{\mu}_2) := \bs{\Sigma}^{-1}(\bs{\mu}_1-\bs{\mu}_2) \]

and the threshold

\[ c=\frac{1}{2}(\bs{\mu}_1'\bs{\Sigma}^{-1}\bs{\mu}_1 - \bs{\mu}_2'\bs{\Sigma}^{-1}\bs{\mu}_2). \]

The classification is therefore given by

\[ C(\bs{t}) = \begin{cases} C_1\ & \text{if}\quad\bs{q}_1'\bs{t} > c,\\ C_2\ & \text{if}\quad\bs{q}_1'\bs{t} < c.\\ \end{cases} \]

MDS and CA

Variable matrix decomposition methods provide different classes of data analysis tools:


  1. Every symmetric positive definite matrix has a square root given by \(\bs{A}^{1/2} = \bs{Q}\bs{\Lambda}^{1/2}\bs{Q}'\) where \(\bs{A}=\bs{Q\Lambda Q}'\) is the EVD of \(\bs{A}\). ↩︎
  2. In fact the canonical correlation vectors are scaled by \(\bs{V}^{-1/2}_{\bs{X}}\) and \(\bs{V}^{-1/2}_{\bs{Y}}\) respectively. ↩︎