# Notes on Multivariate Data Analysis via Matrix Decomposition

### 2019-09-30

These are the notes taken on my master course Multivariate Data Analysis via Matrix Decomposition. If you’re confused with the course name, you can think of this as a statistical course on unsupervised learning. $$\newcommand{1}{\unicode{x1D7D9}_{\{#1\}}} \newcommand{Corr}{\text{Corr}} \newcommand{E}{\text{E}} \newcommand{Cov}{\text{Cov}} \newcommand{Var}{\text{Var}} \newcommand{span}{\text{span}} \newcommand{bs}{\boldsymbol} \newcommand{R}{\mathbb{R}} \newcommand{rank}{\text{rank}} \newcommand{\norm}{\left\lVert#1\right\rVert} \newcommand{diag}{\text{diag}} \newcommand{tr}{\text{tr}} \newcommand{braket}{\left\langle#1\right\rangle} \newcommand{C}{\mathbb{C}}$$

# Notations

First let’s give some standard notations used in this course. Let’s assume no prior knowledge in linear algebra and start from matrix multiplication.

## Matrix Multiplication

We denote a matrix $$\bs{A}\in\R^{m\times n}$$, with its entries defined as $$[a_{ij}]_{i,j=1}^{m,n}$$. Similarly, we define $$\bs{B}=[b_{jk}]_{j,k=1}^{n,p}$$ and thus the multiplication is defined as $$\bs{AB} = [\sum_{j=1}^n a_{ij}b_{jk}]_{i,k=1}^{n,p}$$, which can also be represented in three other ways:

• vector form, using $$\bs{a}$$ and $$\bs{b}$$
• a matrix of products of $$A$$ and $$\bs{b}$$
• a matrix of products of $$\bs{a}$$ and $$\bs{B}$$

A special example of such representation: let’s assume

$\bs{A}=[\bs{a}_1,\bs{a}_2,\ldots,\bs{a}_n]\in\R^{m\times n}\text{ and } \bs{D} = \diag(d_1,d_2,\ldots,d_n) \in\R^{n\times n},$

then we have right away $$\bs{AD}=[\bs{a}_id_i]_{i=1}^n$$.

Exercise With multiplication we care ranks of matrices. There is a quick conclusion: If $$\bs{x}\neq \bs{0}, \bs{y}\neq \bs{0}$$, then $$\rank(\bs{yx'})=1$$. Conversely, if $$\rank(\bs{A})=1$$, then $$\exists\ \bs{x}\neq \bs{0}, \bs{y}\neq \bs{0}$$ s.t. $$\bs{xy'}=\bs{A}$$. Prove it.

## Norms

There are two types of norms in this course we consider:

• (Euclidean) We define the $$l^1$$-norm as $$\norm{x}_1 = \sum_{i=1}^n |x_i|$$, define $$l^2$$-norm as $$\norm{x}_2 = \sqrt{\bs{x'x}}$$, define $$l^{\infty}$$-norm as $$\norm{x}_{\infty} = \max_{1\le i \le n}\{|x_i|\}$$, and define the Mahalanobis norm as $$\norm{x}_A = \sqrt{\bs{x'Ax}}$$.
• (Frobenius) We define the Frobenius norm of a matrix as $$\norm{\bs{A}}_F=\sqrt{\sum_{i=1}^m\sum_{j=1}^n a_{ij}^2}$$. The spectral 2-norm of a matrix is defined as $$\norm{\bs{A}}_2=\max_{\bs{x}\neq \bs{0}} \norm{\bs{Ax}}_2 / \norm{\bs{x}}_2$$.

Properties of these norms:

• $$\norm{\bs{v}}=0$$ iff. $$\bs{v}=\bs{0}$$.
• $$\norm{\alpha \bs{v}} = |\alpha|\cdot\norm{\bs{v}}$$ for any $$\alpha\in\R$$ and any $$\bs{v}\in\mathcal{V}$$.
• (Triangular Inequality) $$\norm{\bs{u} + \bs{v}} \le \norm{\bs{u}} + \norm{\bs{v}}$$ for any $$\bs{u}, \bs{v}\in\mathcal{V}$$.
• (Submultiplicative) $$\norm{\bs{AB}}\le \norm{\bs{A}}\cdot \norm{\bs{B}}$$ for every formable matrices $$\bs{A}$$ and $$\bs{B}$$.

Exercise Try to prove them for Euclidean 2-norm, Frobenius norm and spectral 2-norm.

## Inner Products

There are two types of inner products we consider:

• (Euclidean) We define the inner product of vectors $$\bs{x},\bs{y}\in\R^n$$ as $$\bs{x'y}=\sum_{i=1}^n x_iy_i$$.
• (Frobenius) We define the inner product of matrices $$\bs{A},\bs{B}\in\R^{m\times n}$$ as $$\braket{\bs{A},\bs{B}}=\tr(\bs{A'B})=\sum_{i=1}^m\sum_{j=1}^n a_{ij}b_{ij}$$.

A famous inequality related to these inner products is the Cauchy-Schwarz inequality, which states

• (Euclidean) $$|\bs{x'y}|\le \norm{\bs{x}}_2\cdot\norm{\bs{y}}_2$$ for any $$\bs{x,y}\in\R^n$$.
• (Frobenius) $$|\braket{\bs{A},\bs{B}}|\le\norm{\bs{A}}_F\cdot\norm{\bs{B}}_F$$ for any $$\bs{A},\bs{B}\in\R^{m\times n}$$.

# Eigenvalue Decomposition (EVD)

The first matrix decomposition we’re gonna talk about is the eigenvalue decomposition.

## Eigenvalues and Eigenvectors

For square matrix $$\bs{A}\in\R^{n\times n}$$, if $$\bs{0}\neq \bs{x}\in\C^n$$ and $$\lambda\in\C$$ is s.t. $$\bs{Ax} = \lambda\bs{x}$$, then $$\lambda$$ is called an engenvalue of $$\bs{A}$$ and $$\bs{x}$$ is called the $$\lambda$$-engenvector of $$\bs{A}$$.

Ideally, we want a matrix to have $$n$$ eigenvectors and $$n$$ corresponding eigenvectors, linearly independent to each other. This is not always true.

## Existence of EVD

Theorem $$\bs{A}\in\R^{n\times n}$$ have $$n$$ eigenvalues iff. there exists an invertible $$\bs{X}\in\R^{n\times n}$$ s.t. $$\bs{X}^{-1}\bs{A}\bs{X}=\bs{\Lambda}$$, i.e. $$\bs{A}$$ is diagonizable. This gives $$\bs{A}=\bs{X}\bs{\Lambda}\bs{X}^{-1}$$, which is called the eigenvalue decomposition (EVD).

Theorem (Spectral Theorem for Symmetric Matrices) For symmetric matrix $$\bs{A}\in\R^{n\times n}$$ there always exists an orthogonal matrix $$\bs{Q}$$, namely $$\bs{Q}'\bs{Q}=\bs{I}$$, that gives

$\bs{A}=\bs{Q\Lambda Q}' = \sum_{i=1}^n \lambda_i \bs{q}_i \bs{q}_i'$

where $$\bs{q}$$ are column vectors of $$\bs{Q}$$. This is called the symmetric EVD, aka. $$\bs{A}$$ being orthogonally diagonalizable.

## Properties of EVD

We have several properties following the second theorem above. For all $$i=1,2,\ldots, n$$

• $$\bs{A}\bs{q}_i = \lambda_i \bs{q}_i$$ (can be proved using $$\bs{Q}^{-1}=\bs{Q}'$$)
• $$\norm{\bs{q}_i}_2=1$$ (can be proved using $$\bs{QQ}'=\bs{I}$$)

The second theorem above can also be represented as

Theorem If $$\bs{A}=\bs{A}'$$, then $$\bs{A}$$ has $$n$$ orthogonal eigenvectors.

# Singular Value Decomposition (SVD)

For general matrices, we have singular value decomposition.

## Definition

The most famous form of SVD is define as

$\bs{A} = \bs{U} \bs{\Sigma} \bs{V}'$

where $$\bs{A}\in\R^{m\times n}$$, $$\bs{U}\in\R^{m\times m}$$, $$\bs{\Sigma}\in\R^{m\times n}$$ and $$\bs{V}\in\R^{n\times n}$$. Specifically, both $$\bs{U}$$ and $$\bs{V}$$ are orthogonal (i.e. $$\bs{U}'\bs{U}=\bs{I}$$, same for $$\bs{V}$$) and $$\bs{\Sigma}$$ is diagonal. Usually, we choose the singular values to be non-decreasing, namely

$\bs{\Sigma}=\diag(\sigma_1,\sigma_2,\ldots,\sigma_{\min\{m,n\}})\quad\text{where}\quad \sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_{\min\{m,n\}}.$

## Terminology

Here we define a list of terms that’ll be used from time to time:

• (SVD) $$\bs{A} = \bs{U} \bs{\Sigma} \bs{V}'$$.
• (Left Singular Vectors) Columns of $$\bs{U}$$.
• (Right Singular Vectors) Columns of $$\bs{V}$$.
• (Singular Values) Diagonal entries of $$\bs{\Sigma}$$.

## Three Forms of SVD

Besides the regular SVD given above, we have the outer product SVD:

$\bs{A} = \sum_{i=1}^{\min\{m,n\}}\!\!\!\sigma_i \bs{u}_i \bs{v}_i'$

and condensed SVD:

$\bs{A} = \bs{U}_r\bs{\Sigma}_r\bs{V}_r'$

where $$r=\rank(\bs{A})$$ is also the number of non-zero singular values. In this form, we have $$\bs{\Sigma}_r\in\R^{r\times r}$$ with proper chunked $$\bs{U}_r$$ and $$\bs{V}_r$$.

## Existence of SVD

Theorem (Existence of SVD) Let $$\bs{A}\in\R^{m\times n}$$ and $$r=\rank(\bs{A})$$. Then $$\exists\ \bs{U}_r\in\R^{m\times r}$$, $$\bs{V}_r\in\R^{n\times r}$$ and $$\bs{\Sigma}_r\in\R^{r\times r}$$ s.t. $$\bs{A} = \bs{U}_r\bs{\Sigma}_r\bs{V}_r'$$ where $$\bs{U}_r$$ and $$\bs{V}_r$$ are orthogonal and $$\bs{\Sigma}_r$$ is diagonal. This means condensed SVD exists and therefore the rest two forms.

Proof. Define symmetric $$\bs{W}\in\R^{(m+n)\times(m+n)}$$ as

$\bs{W} = \begin{bmatrix} \bs{0} & \bs{A} \\ \bs{A}' & \bs{0} \end{bmatrix}$

which has an orthogonal EVD as $$\bs{W} = \bs{Z}\bs{\Lambda}\bs{Z}'$$ where $$\bs{Z}'\bs{Z}=\bs{I}$$. Now, assume $$\bs{z}\in\R^{m+n}$$ is an eigenvector of $$\bs{W}$$ corresponding to $$\lambda$$, then $$\bs{W}\bs{z} = \lambda \bs{z}$$. Denote the first $$m$$ entries of $$\bs{z}$$ as $$\bs{x}$$ and the rest $$\bs{y}$$, which gives

$\begin{bmatrix} \bs{0} & \bs{A}\\ \bs{A}' & \bs{0} \end{bmatrix} \begin{bmatrix} \bs{x} \\ \bs{y} \end{bmatrix} = \lambda \begin{bmatrix} \bs{x} \\ \bs{y} \end{bmatrix} \Rightarrow \begin{cases} \bs{Ay} = \lambda \bs{x},\\ \bs{A}'\bs{x} = \lambda \bs{y}. \end{cases}$

Using this results

$\begin{bmatrix} \bs{0} & \bs{A}\\ \bs{A}' & \bs{0} \end{bmatrix} \begin{bmatrix} \bs{x} \\ -\bs{y} \end{bmatrix} = \begin{bmatrix} -\bs{Ay} \\ \bs{A}'\bs{y} \end{bmatrix} = \begin{bmatrix} -\lambda \bs{x}\\ \lambda \bs{y} \end{bmatrix} = -\lambda\begin{bmatrix} \bs{x}\\ -\bs{y} \end{bmatrix}$

which means $$-\lambda$$ is also an engenvalue of $$\bs{W}$$. Hence, we know

\begin{align} \bs{W} &= \bs{Z}\bs{\Lambda}\bs{Z}' = \bs{Z}_r\bs{\Lambda}_r\bs{Z}_r'\\ &= \begin{bmatrix} \bs{X} & \bs{X}\\ \bs{Y} & -\bs{Y} \end{bmatrix} \begin{bmatrix} \bs{\Sigma} & \bs{0}\\ \bs{0} & -\bs{\Sigma} \end{bmatrix} \begin{bmatrix} \bs{X} & \bs{X}\\ \bs{Y} & -\bs{Y} \end{bmatrix}'\\ &= \begin{bmatrix} \bs{0} & \bs{X}\bs{\Sigma}\bs{Y}'\\ \bs{Y}\bs{\Sigma}\bs{X}' & \bs{0} \end{bmatrix}. \end{align}

Therefore, we conclude $$\bs{A}=\bs{X}\bs{\Sigma}\bs{Y}'$$ where now all we need to prove is the orthogonality of $$\bs{X}$$ and $$\bs{Y}$$. Let’s take a look at $$\bs{z}=(\bs{x},\bs{y})$$ we just defined. Let

$\norm{\bs{z}}=\bs{z}'\bs{z}=\bs{x}'\bs{x} + \bs{y}'\bs{y} = 2.$

From orthogonality of eigenvectors corresponding to different eigenvalues, we also know

$\bs{z}'\bar{\bs{z}} = \bs{x}'\bs{x} - \bs{y}'\bs{y} = 0$

which altogether gives $$\norm{\bs{x}}=\norm{\bs{y}}=1$$.Q.E.D.

## How to Calculate SVD

Three steps to calculate the SVD of $$\bs{A}\in\R^{m\times n}$$:

• $$\bs{\Sigma}$$: calculate eigenvalues of $$\bs{AA}'$$, then let $$\bs{\Sigma}=\diag\{\sigma_1,\sigma_2,\ldots,\sigma_r\}\in\R^{m\times n}$$.
• $$\bs{U}$$: calculate eigenvectors of $$\bs{AA}'$$, then normalize them to norm $$1$$, then $$\bs{U}=(\bs{u_1},\bs{u_2},\ldots,\bs{u}_m)$$.
• $$\bs{V}$$: calculate eigenvectors of $$\bs{A}'\bs{A}$$, then normalize them to norm $$1$$, then $$\bs{V}=(\bs{v}_1,\bs{v}_2,\ldots,\bs{v}_n)$$.

Remark: Alternatively you may use formula $$\bs{U}=\bs{AV\Sigma}^{-1}\Rightarrow \bs{u}_i=\bs{Av}_i/\sigma_i$$.

## Properties of SVD

There are several characteristics we have about SVD:

• The $$\bs{W}$$ decomposition above.
• The left singular vector $$\bs{v}$$ given by $$\bs{Au} = \sigma \bs{v}$$, and the right singular vector $$\bs{u}$$ given by $$\bs{A}'\bs{v} = \sigma \bs{u}$$.
• Relationship with eigenvectors/eigenvalues…
• of $$\bs{A}'\bs{A}$$: $$\bs{A}'\bs{A}\bs{u} = \sigma\bs{A}'\bs{v} = \sigma^2\bs{u}$$.
• of $$\bs{AA}'$$: $$\bs{AA}'\bs{v} = \sigma\bs{A}\bs{u} = \sigma^2\bs{v}$$.
• Frobenius norms (eigenvalues cannot define a norm!):
• $$\norm{\bs{A}}_F^2 = \sum_{i,j=1}^{m,n}a_{ij}^2=\sum_{i=1}^r\sigma_i^2$$.
• $$\norm{\bs{A}}_2 = \max_{\bs{x}\neq 0} \norm{\bs{Ax}}_2 / \norm{\bs{x}}_2 = \sigma_1$$.

Exercise Show how to use SVD to calculate these two norms.

## Applications of SVD

1) Projections. One of the most importance usages of SVD is computing projections. $$\bs{P}\in\R^{n\times n}$$ is a projection matrix iff. $$\bs{P}^2=\bs{P}$$. More commonly, we consider orthogonal projection $$\bs{P}$$ that’s also symmetric. Now let’s consider dataset $$\bs{A}\in\R^{m\times n}$$ where we have $$n$$ observations, each with $$m$$ dimensions. Suppose we want to project this dataset onto $$\bs{W}\subseteq\R^m$$ that has $$k$$ dimensions, i.e.

$\bs{W} = \span\{\bs{q}_1,\bs{q}_2,\ldots,\bs{q}_k\},\quad \bs{q}_i'\bs{q}_j=\1{i=j}$

then the projection matrix would be $$\bs{P}_{\bs{W}}=\bs{Q}_k\bs{Q}_k'$$.

Nearest Orthogonal Matrix. The nearest orthogonal matrix of $$\bs{A}\in\R^{p\times p}$$ is given by

$\min_{\bs{X}'\bs{X}=\bs{I}}\norm{\bs{A}-\bs{X}}_F$

which solves if we have optima for

\begin{align} \min_{\bs{X}'\bs{X}=\bs{I}}\norm{\bs{A}-\bs{X}}_F^2 &= \min_{\bs{X}'\bs{X}=\bs{I}}\tr[(\bs{A}-\bs{X})'(\bs{A}-\bs{X})]\\&= \min_{\bs{X}'\bs{X}=\bs{I}}\tr[\bs{A}'\bs{A} - \bs{X}'\bs{A} - \bs{A}'\bs{X} + \bs{X}'\bs{X}]\\&= \min_{\bs{X}'\bs{X}=\bs{I}} \norm{\bs{A}}_F^2 - \tr(\bs{A}'\bs{X}) - \tr(\bs{X}'\bs{A}) + \tr(\bs{X}'\bs{X})\\&= \norm{\bs{A}}_F^2 + n - 2\max_{\bs{X}'\bs{X}=\bs{I}} \tr(\bs{A}'\bs{X}) \end{align}.

Now we try to solve

$\max_{\bs{X}'\bs{X}=\bs{I}} \tr(\bs{A}'\bs{X})$

and claim the solution is given by $$\bs{X} = \bs{U}\bs{V}'$$ where $$\bs{U}$$ and $$\bs{V}$$ are derived from SVD of $$\bs{A}$$, namely $$\bs{A} = \bs{U\Sigma V}'$$. Proof: We know

$\tr(\bs{A}'\bs{X}) = \tr(\bs{V}\bs{\Sigma}'\bs{U}'X) = \tr(\bs{\Sigma}'\bs{U}'\bs{X}\bs{V}) =: \tr(\bs{\Sigma}'\bs{Z})$

where we define $$\bs{Z}$$ as the product of the three orthogonal matrices, which therefore is orthogonal: $$\bs{Z}'\bs{Z}=\bs{I}$$.

Orthogonality of $$\bs{Z}$$ gives $$\forall i$$

$z_{i1}^2 + z_{i2}^2 + \cdot + z_{ip}^2 = 1 \Rightarrow z_{ii} \ge 1.$

Hence, (note all singular values are non-negative)

$\tr(\bs{\Sigma}'\bs{Z}) = \sum_{i=1}^p \sigma_i z_{ii} \le \sum_{i=1}^p \sigma_i$

which gives optimal $$\bs{Z}^*=\bs{I}$$ and thus the solution follows.

2) Orthogonal Procrustes Problem. This seeks the solution to

$\min_{\bs{X}'\bs{X}=\bs{I}}\norm{\bs{A}-\bs{BX}}_F$

which is, similar to the problem above, given by the SVD of $$\bs{BA}'=\bs{U\Sigma V}'$$, namely $$\bs{X}=\bs{UV}'$$.

3) Nearest Symmetric Matrix for $$\bs{A}\in\R^{p\times p}$$ seeks solution to $$\min\norm{\bs{A}-\bs{X}}_F$$, which is simply

$\bs{X} = \frac{\bs{A}+\bs{A}'}{2}.$

In order to prove it, write $$\bs{A}$$ in the form of

$\bs{A} = \frac{\bs{A} + \bs{A}'}{2} + \frac{\bs{A} - \bs{A}'}{2} =: \bs{X} + \bs{Y}.$

Notice $$\tr(\bs{X}'\bs{Y})=0$$, hence by Pythagoras we know $$\bs{Y}$$ is the minimum we can find for the problem above.

4) Best Rank-$$r$$ Approximation. In order to find the best rank-$$r$$ approximation in Frobenius norm, we need solution to

$\min_{\rank(\bs{X})\le r} \norm{\bs{A}-\bs{X}}_F$

which is merely $$\bs{X}=\bs{U}_r\bs{\Sigma}_r\bs{V}_r'$$. See condensed SVD above for notation.

The best approximation in 2-norm, namely solution to

$\min_{\rank(\bs{X})\le r} \norm{\bs{A}-\bs{X}}_2,$

is exactly identical to the one above. We may prove both by reduction to absurdity. Proof: Suppose $$\exists \bs{B}\in\R^{n\times p}$$ s.t.

$\norm{\bs{A}-\bs{B}}_2 < \norm{\bs{A}-\bs{X}}_2 = \sigma_{r+1}.$

Now choose $$\bs{w}$$ from kernel of $$\bs{B}$$ and we have

$\bs{Aw}=\bs{Aw}+\bs{0} = (\bs{A}-\bs{B})\bs{w}$

and thus

$\norm{\bs{Aw}}_2 = \norm{(\bs{A}-\bs{B})\bs{w}}_2 \le \norm{\bs{A}-\bs{B}}_2\cdot \norm{\bs{w}}_2 <\sigma_{r+1}\norm{\bs{w}}_2\tag{1}.$

Meanwhile, note $$\bs{w}\in\span\{v_1,v_2,\ldots,v_{r+1}\}=\bs{W}$$, assume particularly $$\bs{w}=\bs{v}_{r+1}\bs{\alpha}$$, then

\begin{align} \norm{\bs{Aw}}_2^2&=\norm{\bs{U}\bs{\Sigma}\bs{V}'\bs{w}}_2^2 = \sum_{i=1}^{r+1}\sigma_i^2\alpha_i^2 \ge \sigma_{r+1}^2\sum_{i=1}^{r+1}\alpha_i^2\\ &= \sigma_{r+1}^2\norm{\bs{\alpha}}_2^2\equiv \sigma_{r+1}^2\norm{\bs{w}}_2^2.\tag{2} \end{align}

Due to contradiction between eq. (1) and (2) we conclude such $$\bs{B}$$ doesn’t exist.

## Orthonormal Bases for Four Subspaces using SVD

SVD can be used to get orthonormal bases for each of the four subspaces: the column space $$C(\bs{A})$$, the null space $$N(\bs{A})$$, the row space $$C(\bs{A}')$$, and the left null space $$N(\bs{A}')$$.

• $$\bs{U}_r$$ forms a basis of $$C(\bs{A})$$.
• $$\bar{\bs{U}}_r$$ forms a basis of $$N(\bs{A}')$$.
• $$\bs{V}_r$$ forms a basis of $$C(\bs{A}')$$.
• $$\bar{\bs{V}}_r$$ forms a basis of $$N(\bs{A}')$$.

See this post for detailed proof.

# Principle Component Analysis (PCA)

PCA is the most important matrix analysis tool. In this section we use $$\bs{X}=(X_1,X_2,\ldots,X_p)$$ to denote a vector of random variables. Being capitalized here means they are random variables rather than observations, and thus a capitalized bold symbol stands still for a vector.

## Three Basic Formulas (for Population Analysis)

Expectation:

$\E[\bs{AX}] = \bs{A}\E[\bs{X}].$

Variance:

$\Var[\bs{Ax}] = \bs{A}\Var[\bs{X}]\bs{A}'.$

Covariance:

$\Cov[\bs{a}'\bs{X},\bs{b}'\bs{X}] = \bs{a}'\Var[\bs{X}]\bs{b}.$

## Definition of Population PCA

Given $$\bs{X}:\Omega\to\R^p$$, find $$\bs{A}\in\R^{p\times p}$$ s.t.

• $$Y_1,Y_2,\ldots,Y_p$$ are uncorrelated, where $$\bs{Y}=(Y_1,Y_2,\ldots,Y_p)=\bs{A}\bs{X}$$.
• $$Y_1,Y_2,\ldots,Y_p$$ have variances as large as possible.

These two condisions can be described in equations as below:

• $$\Cov[Y_i,Y_j]=\bs{O}$$ for all $$i\neq j$$.
• $$\Var[Y_i]$$ is maximized for all $$i=1,2,\ldots,p$$ (under the restraints that $$\norm{\bs{a}_j}=1$$ for all $$j=1,2,\ldots,p$$). These $$Y_1,Y_2,\ldots,Y_p$$ are called principle components of $$\bs{X}$$.

## How to Calculate Population PCs

We can do it recursively:

• 1st PC: for $$Y_1=\bs{a}_1'\bs{X}$$, let $$\bs{a}_1=\arg\max\{\Var[\bs{a}_1'\bs{X}]\}$$ s.t. $$\norm{\bs{a}_1}=1$$.
• 2nd PC: for $$Y_2=\bs{a}_2'\bs{X}$$, let $$\bs{a}_2=\arg\max\{\Var[\bs{a}_2'\bs{X}]\}$$ s.t. $$\norm{\bs{a}_2}=1$$, $$\Cov[\bs{a}_1'\bs{X},\bs{a}_2'\bs{X}]=0$$.
• 3rd PC: for $$Y_3=\bs{a}_3'\bs{X}$$, let $$\bs{a}_3=\arg\max\{\Var[\bs{a}_3'\bs{X}]\}$$ s.t. $$\norm{\bs{a}_3}=1$$, $$\Cov[\bs{a}_1'\bs{X},\bs{a}_3'\bs{X}]=0$$ and $$\Cov[\bs{a}_2'\bs{X},\bs{a}_3'\bs{X}]=0$$.

Or, we can do it analytically by the theorem below:

Theorem Let $$\bs{\Sigma}=\Cov[\bs{X}]$$ and let the EVD be $$\bs{\Sigma} = \bs{A}\bs{\Lambda}\bs{A}'$$, then it can be proved that $$Y_k = \bs{a}_k'\bs{X}$$ is the $$k$$-th PC.

## Properties of Population PCs

• The total variance is not changed: $$\sum \Var[X_i]=\sum\Var[Y_i]$$.
• The proportion in variance of the $$k$$-th PC is $$\frac{\Var[Y_k]}{\sum \Var[Y_i]} = \frac{\lambda_k}{\sum \lambda_i}$$ where $$\lambda_i$$ is the $$i$$-th eigenvalue.
• The correlation $$\Corr[Y_i, X_j]=\sqrt{\lambda_i}a_{ij}/\sigma_j$$.

## Definition of Sample PCA

Given $$n$$ samples: $$\{X(\omega_1), X(\omega_2), \ldots, X(\omega_n)\}$$. Everything is just the same except being in sample notations, e.g. $$\bar{\bs{X}}=\bs{X}'\bs{1} / n$$ and $$S=(\bs{X}-\bar{\bs{X}}'\bs{1})'(\bs{X}-\bar{\bs{X}}'\bs{1}) / (n-1)$$.

## How to Calculate Sample PCA

In order to avoid loss of precision in calculating the $$\bs{S}$$, we do SVD instead of EVD, on the mean-centered sample matrix $$\bs{X}-\bs{1}\bar{\bs{X}}=\bs{U\Sigma V}'\in\R^{n\times p}$$. Then it can be proved that the $$k$$-th sample PC is given by $$\bs{v}_k$$, $$k=1,2,\ldots,p$$.

Remarks: In this case, we call $$\bs{V}$$ the loading matrix, and $$\bs{U\Sigma}:=\bs{T}$$ the score matrix. The PCA scatter plot is thereby the projection onto the PCs, i.e. the columns of $$\bs{V}$$. Specifically, the coordinates are gonna be $$\{(t_{ij}, t_{ik})\in\R^2: i=1,2,\ldots, n\}$$ namely the selected first columns of the score matrix. This is identical to manually projecting $$\bs{X}$$ onto selected columns of $$\bs{Q}$$ (the principle components) as it can be proved that $$t_{ij}=\bs{x}_i'\bs{q}_j$$. However, by using SVD we avoid miss-calculating the eigenvalues in low precision systems.

## Definition of Variable PCA

This is merely population PCA on $$\bs{X}'\in\R^{p\times n}$$. Transposing $$\bs{X}$$ swaps the roles of the number of variables and the size of population. The SVD now becomes

$\bs{X}' = \bs{V\Sigma U}'$

where we now instead call $$\bs{V\Sigma}$$ the $$\bs{T}$$ variable.

Remarks: By plotting $$\bs{X}$$ against $$\bs{V}$$ we get PCA scatter plot; by plotting $$\bs{X}$$ against $$\bs{U}$$ on the same piece of paper where draw this PCA scatter plot, we get the so-called biplot.

## Application of PCA: Sample Factor Analysis (FA)

One sentence to summarize it: sample factor analysis equals PCA. In formula, FA is trying to write $$\bs{X}\in\R^p$$ into

$\bs{X} = \bs{\mu} + \bs{LF} + \bs{\varepsilon}$

where $$\bs{\mu}\in\R^p$$ are $$p$$ means of features (aka. alphas), $$\bs{L}\in\R^{p\times m}$$ are loadings (aka. betas) and $$\bs{F}\in\R^m$$ are called the factors.

There are some assumptions:

• $$m\ll p$$ (describing a lot of features in few factors).
• $$\E[\bs{F}] = 0$$ (means are captured already by $$\bs{\mu}$$), $$\Cov[\bs{F}]=\bs{I}$$ (factors are uncorrelated).
• $$\E[\bs{\varepsilon}]=0$$ (residuals are zero-meaned), $$\Cov[\bs{\varepsilon}]=\Xi$$ is diagonal (residuals are uncorrelated).
• $$\Cov[\bs{\varepsilon},\bs{F}]=\E[\bs{\varepsilon F}']=\bs{O}$$ ($$\bs{\varepsilon}$$ is uncorrelated with $$\bs{F}$$).

With these assumptions we have $$\bs{\Sigma}=\bs{LL}'+\bs{\Xi}$$.

# Canonical Correlation Analysis (CCA)

Similar to PCA, we try to introduce CCA in two ways, namely w.r.t. a population and a sample.

## Notations for Population CCA

Assume $$p\le q$$ (important!!!). Given $$\bs{X}\in\R^p$$ and $$\bs{Y}\in\R^q$$, define

$\mu_{\bs{X}} = \E[\bs{X}]\in\R^p\quad\text{and}\quad \mu_{\bs{Y}} = \E[\bs{Y}]\in\R^q$

and

$\bs{\Sigma}_{\bs{X}} = \Cov[\bs{X}]\in\R^{p\times p}\quad\text{and}\quad \bs{\Sigma}_{\bs{Y}} = \Cov[\bs{Y}]\in\R^{q\times q}.$

Furthermore, define

$\bs{\Sigma}_{\bs{XY}} = \Cov[\bs{X},\bs{Y}]=\E[(\bs{X}-\mu_{\bs{X}})(\bs{Y}-\mu_{\bs{Y}})']\in\R^{p\times q}$

and

$\bs{\Sigma}_{\bs{YX}} = \Cov[\bs{Y},\bs{X}]=\E[(\bs{Y}-\mu_{\bs{Y}})(\bs{X}-\mu_{\bs{X}})']\in\R^{q\times p}.$

Also, let

$\bs{W} = \begin{bmatrix} \bs{X}\\\bs{Y} \end{bmatrix} \in \R^{p+q}.$

Then with this notation we know given $$\bs{a}\in\R^p$$ and $$\bs{b}\in\R^q$$, how expectation, variance and covariance (ans thus correlation) are represented for $$U=\bs{a}'\bs{X}$$ and $$V=\bs{b}'\bs{Y}$$.

## Definition of Population CCA

We calculate canonical correlation variables iteratively:

• $$(U_1,V_1)=\arg\max\{\Cov[U,V]:\Var[U]=\Var[V]=1\}$$.
• $$(U_2,V_2)=\arg\max\{\Cov[U,V]:\Var[U]=\Var[V]=1,\Cov[U,U_1]=\Cov[V,V_1]=\Cov[V,U_1]=\Cov[U,V_1]=0\}$$.
• $$(U_k,V_k)=\arg\max\{\Cov[U,V]:\Var[U]=\Var[V]=1,\Cov[U,U_i]=\Cov[V,V_i]=\Cov[V,U_i]=\Cov[U,V_i]=0\ \ \forall i=1,2,\ldots, k-1\}$$.

Theorem Suppose $$p\le q$$, let $$\Gamma_{\bs{XY}}=\bs{\Sigma}_{\bs{X}}^{-1/2}\bs{\Sigma}_{\bs{XY}}\bs{\Sigma}_{\bs{Y}}^{-1/2}\in\R^{p\times q}$$ and the condensed SVD be1

$\bs{\Gamma}_{\bs{XY}} = \begin{bmatrix} \bs{u}_1 & \bs{u}_2, & \ldots & \bs{u}_p \end{bmatrix}\begin{bmatrix} \sigma_1 & \cdots & \cdots & \bs{O} \\ \vdots & \sigma_2 & \cdots & \vdots \\ \vdots & \vdots & \ddots & \vdots \\ \bs{O} & \cdots & \cdots & \sigma_p \end{bmatrix}\begin{bmatrix} \bs{v}_1'\\ \bs{v}_2' \\ \vdots \\ \bs{v}_p' \end{bmatrix}$

which gives

$\bs{U}_k = \bs{u}_k'\bs{\Sigma}_{\bs{X}}^{-1/2}\bs{X}\quad\text{and}\quad \bs{V}_k = \bs{v}_k'\bs{\Sigma}_{\bs{Y}}^{-1/2}\bs{Y}$

and that $$\rho_k=\sigma_k$$.

We call $$\rho_k=\Corr[U_k,V_k]$$ as the $$k$$-th population canonical correlation. We call $$\bs{a}_k=\bs{\Sigma}_{\bs{X}}^{-1/2}\bs{u}_k$$ and $$\bs{b}_k=\bs{\Sigma}_{\bs{Y}}^{-1/2}\bs{v}_k$$ as the population canonical vectors.

## Properties of Population CCA

The canonical correlation variables have the following three basic properties:

• $$\Cov[U_i,U_j]=\1{i=j}$$.
• $$\Cov[V_i,V_j]=\1{i=j}$$.
• $$\Cov[U_i,V_j]=\1{i=j} \sigma_i$$.

Theorem Let $$\tilde{\bs{X}}=\bs{MX}+\bs{c}$$ be the affine transformation of $$\bs{X}$$. Similarly let $$\tilde{\bs{Y}}=\bs{NY}+\bs{d}$$. Then by using CCA, the results from analyzing $$(\tilde{\bs{X}},\tilde{\bs{Y}})$$ remains unchanged as $$(\bs{X},\bs{Y})$$, namely CCA is affine invariant.

Based on this theorem we have the following properties:

• The canonical correlations between $$\tilde{\bs{X}}$$ and $$\tilde{\bs{Y}}$$ are identical to those between $$\bs{X}$$ and $$\bs{Y}$$.
• The canonical correlation vectors are not the same. They now becomes $$\tilde{\bs{a}_k}=(\bs{M}')^{-1}\bs{a}_k$$ and $$\tilde{\bs{b}_k}=(\bs{N}')^{-1}\bs{b}_k$$.
• By using covariances $$\bs{\Sigma}$$ or correlations $$\bs{P}$$ makes no difference in CCA2. This is not true for PCA, i.e. there is no simple relationship between PCA obtained from covariances and PCA from correlations.

## Example of Calculating Population CCA

Let’s assume $$p=q=2$$, let

$\bs{P}_{\bs{X}} = \begin{bmatrix} 1 & \alpha \\ \alpha & 1 \end{bmatrix},\quad \bs{P}_{\bs{Y}} = \begin{bmatrix} 1 & \gamma\\ \gamma & 1 \end{bmatrix}\quad\text{and}\quad \bs{P}_{\bs{XY}} = \begin{bmatrix} \beta & \beta \\ \beta & \beta \end{bmatrix}$

with $$|\alpha| < 1$$, $$|\gamma| < 1$$. In order to find the 1st canonical correction of $$\bs{X}$$ and $$\bs{Y}$$, we first check

$\det(\bs{P}_{\bs{X}}) = 1 - \alpha^2 > 0\quad\text{and}\quad \det(\bs{P}_{\bs{Y}}) = 1 - \gamma^2 > 0.$

Therefore, we may calculate

$\bs{H}_{\bs{XY}} = \bs{P}_{\bs{X}}^{-1}\bs{P}_{\bs{XY}}\bs{P}_{\bs{Y}}^{-1}\bs{P}_{\bs{XY}} = \frac{\beta}{1-\alpha^2}\begin{bmatrix} 1 & -\alpha \\ -\alpha & 1 \end{bmatrix}\begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix} = \frac{2\beta^2}{(1+\alpha)(1+\gamma)}\bs{1}\bs{1}'.$

It’s easy to show that $$\lambda_1= 2$$ and $$\lambda_2=0$$ are the two eigenvalues of $$\bs{11}'$$ and thus the 1st canonical correction is

$\rho_1 = \sqrt{\frac{4\beta^2}{(1+\alpha)(1+\gamma)}} = \frac{2\beta}{\sqrt{(1+\alpha)(1+\gamma)}}.$

## Notations for Sample CCA

Given $$n$$ samples: $$\bs{X}=\{\bs{X}(\omega_1), \bs{X}(\omega_2), \ldots, \bs{X}(\omega_n)\}\in\R^{n\times p}$$ and similarly $$\bs{Y}\in\R^{n\times q}$$. Everything is just the same except being in sample notations, e.g. $$\bar{\bs{X}}=\bs{X}'\bs{1} / n$$ and $$S_{\bs{X}}=(\bs{X}-\bar{\bs{X}}'\bs{1})'(\bs{X}-\bar{\bs{X}}'\bs{1}) / (n-1)$$.

Besides the regular notations, here we also define $$r_{\bs{XY}}(\bs{a},\bs{b})$$ as the sample correlation of $$\bs{a}'\bs{X}$$ and $$\bs{b}'\bs{Y}$$:

$r_{\bs{XY}}(\bs{a},\bs{b}) = \frac{\bs{a}'\bs{S}_{\bs{XY}}\bs{b}}{\sqrt{\bs{a}'\bs{S}_{\bs{X}}\bs{a}}\sqrt{\bs{b}'\bs{S}_{\bs{Y}}\bs{b}}}.$

## Definition of Sample CCA

Same as population CCA, we give sample CCA iteratively (except that here we’re talking about canonical correlation vectors directly):

• $$(\hat{\bs{a}}_1,\hat{\bs{b}}_1)=\arg\max\{\bs{a}'\bs{S}_{\bs{XY}}\bs{b}:\bs{a}'\bs{S}_{\bs{X}}\bs{a}=\bs{b}'\bs{S}_{\bs{Y}}\bs{b}=1\}$$.
• $$(\hat{\bs{a}}_2,\hat{\bs{b}}_2)=\arg\max\{\bs{a}'\bs{S}_{\bs{XY}}\bs{b}:\bs{a}'\bs{S}_{\bs{X}}\bs{a}=\bs{b}'\bs{S}_{\bs{Y}}\bs{b}=1,\bs{a}'\bs{S}_{X}\hat{\bs{a}}_1=\bs{b}'\bs{S}_{\bs{Y}}\hat{\bs{b}}_1=\bs{a}'\bs{S}_{\bs{XY}}\hat{\bs{b}}_1=\bs{b}'\bs{S}_{\bs{YX}}\hat{\bs{a}}_1=0\}$$.
• $$(\hat{\bs{a}}_k,\hat{\bs{b}}_k)=\arg\max\{\bs{a}'\bs{S}_{\bs{XY}}\bs{b}:\bs{a}'\bs{S}_{\bs{X}}\bs{a}=\bs{b}'\bs{S}_{\bs{Y}}\bs{b}=1,\bs{a}'\bs{S}_{X}\hat{\bs{a}}_i=\bs{b}'\bs{S}_{\bs{Y}}\hat{\bs{b}}_i=\bs{a}'\bs{S}_{\bs{XY}}\hat{\bs{b}}_i=\bs{b}'\bs{S}_{\bs{YX}}\hat{\bs{a}}_i=0\}\ \ \forall i=1,2,\ldots,k-1$$.

Theorem Given $$p\le q$$, let $$\bs{G}_{\bs{XY}} = \bs{S}_{\bs{X}}^{-1/2}\bs{S}_{\bs{XY}}\bs{S}_{\bs{Y}}^{-1/2}\in\R^{p\times q}$$ and the SVD be

$\bs{G}_{\bs{XY}} = \begin{bmatrix} \bs{u}_1 & \bs{u}_2 & \cdots & \bs{u}_p \end{bmatrix}\begin{bmatrix} \sigma_1 & \cdots & \cdots & \bs{O} \\ \vdots & \sigma_2 & \cdots & \vdots \\ \vdots & \vdots & \ddots & \vdots \\ \bs{O} & \cdots & \cdots & \sigma_p \end{bmatrix}\begin{bmatrix} \bs{v}_1' \\ \bs{v}_2' \\ \vdots \\ \bs{v}_p' \end{bmatrix}.$

Then we have

$\hat{\bs{a}}_k = \bs{S}_{\bs{X}}^{-1/2} \bs{u}_k\in\R^p\quad \text{and}\quad \hat{\bs{b}}_k = \bs{S}_{\bs{Y}}^{-1/2} \bs{v}_k\in\R^q.$

$r_k = r_{\bs{XY}}(\hat{\bs{a}}_k, \hat{\bs{b}}_k) = \sigma_k.$

We call $$r_k$$ the $$k$$-th sample canonical correlation. Also, $$\bs{Xa}_k$$ and $$\bs{Yb}_k$$ are called the score vectors.

## Properties of Sample CCA

Everything with population CCA, including the affine invariance, holds with sample CCA, too.

## Example of Calculating Sample CCA

Let $$p=q=2$$, let

$\bs{X}=\begin{bmatrix} \text{head length of son1}\\ \text{head breath of son1} \end{bmatrix}=\begin{bmatrix} l_1\\b_1 \end{bmatrix}\quad\text{and}\quad \bs{Y}=\begin{bmatrix} \text{head length of son2}\\ \text{head breath of son2} \end{bmatrix}=\begin{bmatrix} l_2\\b_2 \end{bmatrix}.$

Assume there’re $$25$$ families, each having $$2$$ sons. The $$\bs{W}$$ matrix is therefore

$\bs{W} = \begin{bmatrix} \bs{l}_1 & \bs{b}_1 & \bs{l}_2 & \bs{b}_2 \end{bmatrix}\in\R^{25\times 4}.$

In addition, we may also calculate the correlations $$\bs{R}_{\bs{X}}$$, $$\bs{R}_{\bs{Y}}$$ and $$\bs{R}_{\bs{XY}}$$, from which we have $$\bs{G}_{\bs{XY}}$$ given by

$\bs{G}_{\bs{XY}} = \bs{R}_{\bs{X}}^{-1/2}\bs{R}_{\bs{XY}}\bs{R}_{\bs{Y}}^{-1/2}$

which gives the sample canonical correlations $$r_1$$ and $$r_2$$ (in most cases $$r_1\ll r_2$$). In the meantime, we have $$\bs{u}_{1,2}$$ and $$\bs{v}_{1,2}$$ and because of significant difference in scale of $$r_1$$ and $$r_2$$, we don’t care about $$\bs{u}_2$$ and $$\bs{v}_2$$. From $$\bs{u}_1$$ and $$\bs{v}_1$$ we can calculate $$\hat{\bs{a}}_1$$ and $$\hat{\bs{b}}_2$$, which indicate the linear relationship between the features. Specifically, the high correlation $$r_1$$ tells that $$\hat{U}_1=\hat{\bs{a}}_1\bs{X}$$ and $$\hat{V}_1=\hat{\bs{b}}_1\bs{Y}$$, which are essentially the “girths” (height $$+$$ breath) of sons’ faces, are highly correlated. In comparison, data shows that $$\hat{U}_2$$ and $$\hat{V}_2$$, which describes the “shapes” (height $$-$$ breath) of sons’ faces, are poorly correlated.

# Linear Discriminant Analysis (LDA)

Different from the unsupervised learning algorithms we’ve been discussing in the previous sections, like PCA, FA and CCA, LDA is a supervised classification method in that the number of classes to be divided into is specified explicitly.

## Notations for LDA

Assume $$g$$ different classes $$C_1,C_2,\ldots,C_g$$. We try to solve the problem asking for the optimal division of $$n$$ rows of $$\bs{X}\in\R^{n\times p}$$ into $$g$$ parts: for each $$i=1,2,\ldots,g$$ denote

$\bs{X}_i = \begin{bmatrix} \bs{x}_{i,1}'\\ \bs{x}_{i,2}'\\ \vdots\\ \bs{x}_{i,n_i}' \end{bmatrix}\in\R^{n_i\times p}$

with $$\sum_{i=1}^g n_i=g$$, then given $$\bs{a}\in\R^p$$ we can define

$\bs{X}_i\bs{a} =: \bs{y}_i \in \R^{n_i},\quad i=1,2,\ldots,g.$

Now, recall the mean-centering matrix

$\bs{H} = \bs{I} - \frac{\bs{\iota}\bs{\iota}'}{\bs{\iota}'\bs{\iota}}$

which provides handy feature that centers a vector by its mean: for any $$\bs{X}$$

$\bs{HX} = \bs{X} - \bs{\iota}'\bar{\bs{X}}.$

With $$\bs{H}$$ we also have the total sum-of-squares as

$\bs{T} = \bs{X}'\bs{HX} = \bs{X}'\left(\bs{I}-\frac{\bs{\iota}\bs{\iota}'}{\bs{\iota}'\bs{\iota}}\right)\bs{X} = (n-1)\bs{S}.$

Similarly we define $$\bs{H}_i$$ and $$\bs{W}_i=\bs{X}_i'\bs{H}_i\bs{X}_i$$ for each $$i=1,2,\ldots,g$$ and thus

$\sum_{i=1}^g \bs{y}_i'\bs{H}_i\bs{y}_i = \sum_{i=1}^g \bs{a}'\bs{X}_i'\bs{H}_i\bs{X}_i\bs{a} = \bs{a}' \left(\sum_{i=1}^g \bs{X}_i'\bs{H}_i\bs{X}_i\right) \bs{a} = \bs{a}'\sum_{i=1}^g \bs{W}_i\bs{a} =:\bs{a}'\bs{W}\bs{a}$

where $$\bs{W}\in\R^{p\times p}$$ is known as the within-group sum-of-squares. Finally, check

\begin{align} \sum_{i=1}^g n_i (\bar{\bs{y}}_i-\bar{\bs{y}})^2 &= \sum_{i=1}^g n_i (\bar{\bs{X}}_i\bs{a}-\bar{\bs{X}}\bs{a})^2 = \sum_{i=1}^g n_i (\bar{\bs{X}}_i\bs{a}-\bar{\bs{X}}\bs{a})'(\bar{\bs{X}}_i\bs{a}-\bar{\bs{X}}\bs{a}) \\&= \bs{a}' \left[\sum_{i=1}^g n_i (\bar{\bs{X}}_i - \bar{\bs{X}})'(\bar{\bs{X}}_i - \bar{\bs{X}})\right] \bs{a} =: \bs{a}'\bs{B}\bs{a} \end{align}

where we define $$\bs{B}\in\R^{p\times p}$$ as the between-group sum-of-squares.

Theorem For any $$\bs{a}\in\R^p$$, $$\bs{a}'\bs{T}\bs{a} = \bs{a}'\bs{B}\bs{a} + \bs{a}'\bs{W}\bs{a}$$. However, $$\bs{T}\neq \bs{B}+\bs{W}$$.

## Definition of LDA

First let’s give the Fisher Linear Discriminant Function

$f(\bs{a}) = \frac{\bs{a}'\bs{B}\bs{a}}{\bs{a}'\bs{W}\bs{a}}$

by maximizing which, Fisher states, gives the optimal classification.

Theorem Suppose $$\bs{W}\in\R^{p\times p}$$ is nonsingular, let $$\bs{q}_1\in\R^p$$ be the principle eigenvector of $$\bs{W}^{-1}\bs{B}$$ corresponding to $$\lambda_1=\lambda_{\max}(\bs{W}^{-1}\bs{B})$$, then it can be shown that $$\bs{q}_1=\arg\max_{\bs{a}}f(\bs{a})$$ and $$\lambda_1=\max_{\bs{a}}f(\bs{a})$$.

Proof. We know

$\max_{\bs{a}} \frac{\bs{a}'\bs{B}\bs{a}}{\bs{a}'\bs{W}\bs{a}} = \max_{\bs{a}'\bs{W}\bs{a}=1} \bs{a}'\bs{B}\bs{a} = \max_{\bs{b}'\bs{b}=1} \bs{b}'\bs{W}^{-1/2}\bs{B}\bs{W}^{-1/2}\bs{b} = \lambda_{\max}(\bs{W}^{-1/2}\bs{B}\bs{W}^{-1/2}) = \lambda_{\max}(\bs{W}^{-1}\bs{B}).$

Therefore, we have the maximum $$\lambda_1$$ and thereby we find the maxima $$\bs{q}_1$$.Q.E.D.

Remark: $$\lambda_1$$ and $$\bs{q}_1$$ can also be seen as the solutions to

$\bs{B}\bs{x} = \lambda \bs{W}\bs{x},\quad \bs{x}\neq\bs{0}$

which is known as a generalized eigenvalue problem since it reduces to the usual case when $$\bs{W}=\bs{I}$$.

## Classification Rule of LDA

Given a new data point $$\bs{t}\in\R^p$$ we find

$i = \underset{j=1,2,\ldots,g}{\arg\min} |\bs{q}_1'(\bs{t}-\bar{\bs{X}_j})|$

and assign $$\bs{t}$$ to class $$C_j$$. This essentially the heuristic behind Fisher’s LDA.

There is no general formulae for $$g$$-class LDA, but we do have one for specifically $$g=2$$.

## Population LDA for Binary Classification

The population linear discriminant function is

$f(\bs{a}) = \frac{\sigma_{\text{between}}}{\sigma_{\text{within}}} = \frac{(\bs{a}'\bs{\mu}_1 - \bs{a}'\bs{\mu}-2)^2}{\bs{a}'\bs{\Sigma}_1\bs{a} + \bs{a}'\bs{\Sigma}_2\bs{a}} = \frac{[\bs{a}'(\bs{\mu}_1-\bs{\mu}_2)]^2}{\bs{a}'(\bs{\Sigma}_1+\bs{\Sigma}_2)\bs{a}}$

which solves to

$\bs{q}_1 = (\bs{\Sigma}_1 + \bs{\Sigma}_2)^{-1}(\bs{\mu}_1-\bs{\mu}_2) := \bs{\Sigma}^{-1}(\bs{\mu}_1-\bs{\mu}_2)$

and the threshold

$c=\frac{1}{2}(\bs{\mu}_1'\bs{\Sigma}^{-1}\bs{\mu}_1 - \bs{\mu}_2'\bs{\Sigma}^{-1}\bs{\mu}_2).$

The classification is therefore given by

$C(\bs{t}) = \begin{cases} C_1\ & \text{if}\quad\bs{q}_1'\bs{t} > c,\\ C_2\ & \text{if}\quad\bs{q}_1'\bs{t} < c.\\ \end{cases}$

# MDS and CA

Variable matrix decomposition methods provide different classes of data analysis tools:

• Based on SVD we have PCA, FA, HITS and LSI for general $$\bs{A}\in\R^{n\times p}$$.
• Based on EVD we have CCA and MDS for $$\bs{A}\in\R^{p\times p}$$, symmetric.
• Based on generalized EVD (GEVD) we have LDA for $$\bs{A},\bs{B}\in\R^{p\times p}$$, both symmetric.
• Finally, based on generalized SVD (GSVD) we have CA for $$\bs{A}=\bs{U\Sigma}\bs{V}'\in\R^{n\times p}$$ where we instead of orthonormality of $$\bs{U}$$ and $$\bs{V}$$ assume $$\bs{U}'\bs{D}_1\bs{U}=\bs{I}$$ and $$\bs{V}'\bs{D}_2\bs{V}=\bs{I}$$ for diagonal $$\bs{D}_1\in\R^{n\times n}$$ and $$\bs{D}_2\in\R^{p\times p}$$.

1. Every symmetric positive definite matrix has a square root given by $$\bs{A}^{1/2} = \bs{Q}\bs{\Lambda}^{1/2}\bs{Q}'$$ where $$\bs{A}=\bs{Q\Lambda Q}'$$ is the EVD of $$\bs{A}$$. ↩︎
2. In fact the canonical correlation vectors are scaled by $$\bs{V}^{-1/2}_{\bs{X}}$$ and $$\bs{V}^{-1/2}_{\bs{Y}}$$ respectively. ↩︎