Principle Component Analysis¶

This document is based on: Principal Component Analysis from Scratch

There are some differences:

svd decomposition of the covariance matrix is used instead of the Eigen decomposition. Thus, the Eigen value and vector pairs are always in non-ascending order.
Additional Eigen value correction is an optional step which is needed to obtain identical results with scikit-learn.

What is PCA?¶

Greenacre, M., Groenen, P.J.F., Hastie, T. et al. Principal component analysis. Nat Rev Methods Primers 2, 100 (2022):

PCA reduces a cases-by-variables data table to its essential features, called principal components.
Principal components are a few linear combinations of the original variables that maximally explain the variance of all the variables.
The method provides an approximation of the original data table using only these few major components.

Preparations¶

import numpy as np
import matplotlib.pyplot as plt

plt.rcParams.update({
    "text.usetex": False,
    "font.family": "sans-serif",
    "font.size": 8,
})

np.set_printoptions(precision=3, suppress=True)

Iris Dataset¶

The iris data is ideal for our learning purpose since the data has only 4 features: sepal-length, sepal-width, petal-length, and petal-width.

More information about iris plan can be found here this Wikipedia link.

Load the Dataset¶

data = np.loadtxt("./datasets/iris/iris.data", delimiter=",", dtype=str)
X = np.float32(data[:, 0:4])  # 4 features

# Change the labels: string labels to numerical labels:
# Iris-setosa     = 0
# Iris-versicolor = 1
# Iris-virginica  = 2
y = np.zeros(X.shape[0], dtype=np.int32) 
y[np.where(data[:,-1]=='Iris-setosa')]     = 0
y[np.where(data[:,-1]=='Iris-versicolor')] = 1
y[np.where(data[:,-1]=='Iris-virginica')]  = 2

feature_names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width']
n_features = X.shape[1]

print("Feature numbers :", X.shape[1])
print("Number of data  :", X.shape[0])
print("Labels          :", np.unique(y))

Feature numbers : 4
Number of data  : 150
Labels          : [0 1 2]

%matplotlib inline

fig, ax = plt.subplots(nrows=n_features, ncols=n_features, figsize= (n_features*2, n_features*2))
fig.tight_layout(pad=2.0)

names = feature_names

for i in range(n_features):
    J = np.arange(n_features)
    J = np.delete(J, i)
    for k, j in enumerate(J):
        ax[i, k].scatter(X[:, i], X[:, j], c = y)
        ax[i, k].set_xlabel(names[i])
        ax[i, k].set_ylabel(names[j])
        
for i in range(n_features):
    ax[i, -1].hist(X[:, i], color = 'lightblue')
    ax[i, -1].set_ylabel('Count')
    ax[i, -1].set_xlabel(names[i])

_images/16bf06bfeae96865bdea3345b6077995183d5073146a5ec6c937610fe5674139.png

Standardization¶

Definition:

\[ \mathrm{X_{\text{std}}}=\frac{\mathrm{X}-\text { Mean }}{\text { Standard deviation }} \]

means  = np.mean(X, axis=0)
stdevs = np.std(X, axis=0)

X_std = (X - means) / stdevs

Let us plot the standardized dataset.

%matplotlib inline

fig, ax = plt.subplots(nrows=n_features, ncols=n_features, figsize= (n_features*2, n_features*2))
fig.tight_layout(pad=2.0)

names = feature_names

for i in range(n_features):
    J = np.arange(n_features)
    J = np.delete(J, i)
    for k, j in enumerate(J):
        ax[i, k].scatter(X_std[:, i], X_std[:, j], c = y)
        ax[i, k].set_xlabel(names[i])
        ax[i, k].set_ylabel(names[j])
        
for i in range(n_features):
    ax[i, -1].hist(X_std[:, i], color = 'lightblue')
    ax[i, -1].set_ylabel('Count')
    ax[i, -1].set_xlabel(names[i])

_images/b5fbae47ad1667644312460f1208eca0205322cb7cd7ad4875a72dfcd53546b0.png

Covariance Matrix¶

Defintion:

\[\begin{split}\left[\begin{array}{cccc}\operatorname{Var}\left(x_1\right) & \cdots & \operatorname{Cov}\left(x_{n}, x_1\right) \\ \vdots & \ddots & & \vdots \\ \vdots & \ddots & \vdots \\ \operatorname{Cov}\left(x_{n,}, x_1\right) & \cdots & \operatorname{Var}^2\left(x_{n}\right)\end{array}\right]\end{split}\]

where:

\[\operatorname{Cov}(x, y)=\frac{\sum(x-\bar{x})(y-y)}{N-1}\]

and

\[\operatorname{Var}(x)=\frac{\sum(x-\bar{x})^2}{N-1}\]

cov_mat = np.cov(X_std.T) # square symmetric, positive semi-definite
cov_mat

array([[ 1.007, -0.11 ,  0.878,  0.823],
       [-0.11 ,  1.007, -0.423, -0.359],
       [ 0.878, -0.423,  1.007,  0.969],
       [ 0.823, -0.359,  0.969,  1.007]])

Singular Value Decompostion¶

Defintion:

\[ A=U \Sigma V^T\]

For square symmetric positive semi-definite matrix (such as covariance matrix), the eigenvalues and $\operatorname{diag}(\Sigma)$ are exactly the same and $U$ is the Eigen-vector matrix.

U, S, V = np.linalg.svd(cov_mat)
eig_vals = S
eig_vecs = U

print ("Eigen values (lambda):")
print(S)
print()
print("Corresponding Eigen-vector matrix (U):")
print(U)

Eigen values (lambda):
[2.93  0.927 0.148 0.021]

Corresponding Eigen-vector matrix (U):
[[-0.522 -0.372  0.721  0.262]
 [ 0.263 -0.926 -0.242 -0.124]
 [-0.581 -0.021 -0.141 -0.801]
 [-0.566 -0.065 -0.634  0.524]]

Correct the Eigen-vector Matrix¶

This section is actually optional. This step is necessary if we want to keep our results identical to PCA from scikit-learn.

Correct the Eigen matrix such that:

for each column, find the largest absolute value
if the largest absolute value comes from a negative value, make it positive by multiplying the entire column with -1

# Adjusting the eigenvectors (loadings) that are largest in absolute value to be positive
max_abs_idx = np.argmax(np.abs(eig_vecs), axis=0)
signs = np.sign(eig_vecs[max_abs_idx, range(eig_vecs.shape[0])])
eig_vecs_ = eig_vecs*signs[np.newaxis,:]

print(eig_vals)
print()
print(eig_vecs_)

[2.93  0.927 0.148 0.021]

[[ 0.522  0.372  0.721 -0.262]
 [-0.263  0.926 -0.242  0.124]
 [ 0.581  0.021 -0.141  0.801]
 [ 0.566  0.065 -0.634 -0.524]]

Select the Axis Components¶

Let us introdue $W$ as the new axis components where: $$ W \in (U_{i,j})_{1\leq j<k}$$

As an example, let us take $k=2$.

# Select top k eigenvectors
k = 2
W = eig_vecs_[:, :k] # Projection matrix
print(W)
print(W.shape)

[[ 0.522  0.372]
 [-0.263  0.926]
 [ 0.581  0.021]
 [ 0.566  0.065]]
(4, 2)

Project the Dataset to the New Axis Components¶

Let us introduce $X_\text{proj}$ to represent the transformed dataset:

\[X_\text{proj} = X_{\text{std}}\cdot W\]

X_proj = X_std.dot(W)
print(X_proj.shape)

(150, 2)

%matplotlib inline

fig, ax = plt.subplots(figsize=(3,3))

ax.scatter(X_proj[:, 0], X_proj[:, 1], c = y)
ax.set_xlabel('PC1'); plt.xticks([])
ax.set_ylabel('PC2'); plt.yticks([])
ax.set_title('2 components');

_images/cc3ca3020f06c1bb26786509095649e51bae2fc48c491a4cdb021ccf4f4c9762.png

Explained Variance¶

The optimal value for k can be tuned by investigating the explained variance.

The explained variance tells us how much information (variance) can be attributed to each of the principal components.

\[ \text{Explained variance} = \frac{\sum_{i=0}^{r} \lambda_i}{\sum_{i=0}^{m} \lambda_i} = \frac{\lambda_1 + \lambda_2 + \cdots + \lambda_m}{\lambda_1 + \lambda_2 + \cdots + \lambda_r} \]

where: $r<= m$

%matplotlib inline

eig_vals_total = sum(eig_vals)
explained_variance = [(i / eig_vals_total)*100 for i in eig_vals]
explained_variance = np.round(explained_variance, 2)
cum_explained_variance = np.cumsum(explained_variance)

print('Explained variance: {}'.format(explained_variance))
print('Cumulative explained variance: {}'.format(cum_explained_variance))

fig, ax = plt.subplots(figsize= (3, 3))

ax.plot(np.arange(1,n_features+1), cum_explained_variance, '-o')
ax.set_xticks(np.arange(1,n_features+1))
ax.set_xlabel('Number of components')
ax.set_ylabel('Cumulative explained variance');

Explained variance: [72.77 23.03  3.68  0.52]
Cumulative explained variance: [ 72.77  95.8   99.48 100.  ]

_images/3f5daf2833dd9e83b0209db77ab010c064cbdcf6e2232d50e601e1da6f5d5a86.png

[OPTIONAL] Comparison to scikit-learn¶

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca.fit_transform(X_std)

print('by sklearn:')
print(pca.components_.transpose())

print('ours:')
print(W)

by sklearn:
[[ 0.522  0.372]
 [-0.263  0.926]
 [ 0.581  0.021]
 [ 0.566  0.065]]
ours:
[[ 0.522  0.372]
 [-0.263  0.926]
 [ 0.581  0.021]
 [ 0.566  0.065]]

Breast Cancer Wisconsin¶

In this section, we are going to exactly repeat the process but with different dataset. This time we are going to use the “Breast Cancer Winsconsin” dataset. The dataset can be found in this link.

Load the Dataset¶

data = np.loadtxt("./datasets/breast_cancer/wdbc.data", delimiter=",", dtype=str)
X = np.float32(data[:, 2:12])  # 10 dimensions

# Diagnosis (M = malignant, B = benign)
y = np.zeros(X.shape[0], dtype=np.int32) 
y[np.where(data[:,1]=='M')] = 1
y[np.where(data[:,1]=='B')] = 0

n_features = X.shape[1]
feature_names = ['#'+str(k) for k in range(n_features)]

print("Dimension numbers :", X.shape[1])
print("Number of data    :", X.shape[0])
print("Labels            :", np.unique(y))

Dimension numbers : 10
Number of data    : 569
Labels            : [0 1]

%matplotlib inline

fig, ax = plt.subplots(nrows=n_features, ncols=n_features, figsize= (n_features*2, n_features*2))
fig.tight_layout(pad=2.0)

names = feature_names

for i in range(n_features):
    J = np.arange(n_features)
    J = np.delete(J, i)
    for k, j in enumerate(J):
        ax[i, k].scatter(X[:, i], X[:, j], c = y)
        ax[i, k].set_xlabel(names[i])
        ax[i, k].set_ylabel(names[j])
        
for i in range(n_features):
    ax[i, -1].hist(X[:, i], color = 'lightblue')
    ax[i, -1].set_ylabel('Count')
    ax[i, -1].set_xlabel(names[i])

_images/09ecddfc6c767aec2bf214c17394a6aa4fa3c1b94392cbace9eac3a14d5f6df7.png

Standardization¶

means  = np.mean(X, axis=0)
stdevs = np.std(X, axis=0)

X_std = (X - means) / stdevs

%matplotlib inline

fig, ax = plt.subplots(nrows=n_features, ncols=n_features, figsize= (n_features*2, n_features*2))
fig.tight_layout(pad=2.0)

names = feature_names

for i in range(n_features):
    J = np.arange(n_features)
    J = np.delete(J, i)
    for k, j in enumerate(J):
        ax[i, k].scatter(X_std[:, i], X_std[:, j], c = y)
        ax[i, k].set_xlabel(names[i])
        ax[i, k].set_ylabel(names[j])
        
for i in range(n_features):
    ax[i, -1].hist(X_std[:, i], color = 'lightblue')
    ax[i, -1].set_ylabel('Count')
    ax[i, -1].set_xlabel(names[i])

_images/995a6e4aefe7be88b59a870a6a3d320f96703301c57a17cb1af7bc3657a664a4.png

Covariance Matrix¶

cov_mat = np.cov(X_std.T) # square symmetric, positive semi-definite
print(cov_mat)

[[ 1.002  0.324  1.     0.989  0.171  0.507  0.678  0.824  0.148 -0.312]
 [ 0.324  1.002  0.33   0.322 -0.023  0.237  0.303  0.294  0.072 -0.077]
 [ 1.     0.33   1.002  0.988  0.208  0.558  0.717  0.852  0.183 -0.262]
 [ 0.989  0.322  0.988  1.002  0.177  0.499  0.687  0.825  0.152 -0.284]
 [ 0.171 -0.023  0.208  0.177  1.002  0.66   0.523  0.555  0.559  0.586]
 [ 0.507  0.237  0.558  0.499  0.66   1.002  0.885  0.833  0.604  0.566]
 [ 0.678  0.303  0.717  0.687  0.523  0.885  1.002  0.923  0.502  0.337]
 [ 0.824  0.294  0.852  0.825  0.555  0.833  0.923  1.002  0.463  0.167]
 [ 0.148  0.072  0.183  0.152  0.559  0.604  0.502  0.463  1.002  0.481]
 [-0.312 -0.077 -0.262 -0.284  0.586  0.566  0.337  0.167  0.481  1.002]]

Singular Value Decomposition¶

U, S, V = np.linalg.svd(cov_mat)
eig_vals = S
eig_vecs = U

print ("Eigen values (lambda):")
print(S)
print()
print("Corresponding Eigen-vector matrix (U):")
print(U)

Eigen values (lambda):
[5.488 2.523 0.882 0.5   0.373 0.124 0.08  0.035 0.011 0.   ]

Corresponding Eigen-vector matrix (U):
[[-0.364  0.314 -0.124 -0.03   0.031  0.264  0.044 -0.085 -0.474  0.669]
 [-0.154  0.147  0.951 -0.009  0.22   0.032 -0.021  0.007 -0.004 -0.   ]
 [-0.376  0.285 -0.114 -0.013  0.006  0.238  0.083 -0.089 -0.38  -0.74 ]
 [-0.364  0.305 -0.123 -0.013  0.019  0.332 -0.261 -0.145  0.747  0.032]
 [-0.232 -0.402 -0.167  0.108  0.844 -0.062 -0.011 -0.171 -0.006 -0.004]
 [-0.364 -0.266  0.058  0.186 -0.24  -0.005  0.804 -0.064  0.219  0.053]
 [-0.396 -0.104  0.041  0.167 -0.313 -0.601 -0.367 -0.45  -0.081  0.01 ]
 [-0.418 -0.007 -0.069  0.073  0.009 -0.266 -0.141  0.851  0.022  0.004]
 [-0.215 -0.368  0.037 -0.893 -0.113  0.062 -0.048 -0.016 -0.009 -0.001]
 [-0.072 -0.572  0.114  0.349 -0.265  0.568 -0.345  0.065 -0.13  -0.007]]

Correct the Eigen-vector Matrix¶

# Adjusting the eigenvectors (loadings) that are largest in absolute value to be positive
max_abs_idx = np.argmax(np.abs(eig_vecs), axis=0)
signs = np.sign(eig_vecs[max_abs_idx, range(eig_vecs.shape[0])])
eig_vecs_ = eig_vecs*signs[np.newaxis,:]

print(eig_vals)
print()
print(eig_vecs_)

[5.488 2.523 0.882 0.5   0.373 0.124 0.08  0.035 0.011 0.   ]

[[ 0.364 -0.314 -0.124  0.03   0.031 -0.264  0.044 -0.085 -0.474 -0.669]
 [ 0.154 -0.147  0.951  0.009  0.22  -0.032 -0.021  0.007 -0.004  0.   ]
 [ 0.376 -0.285 -0.114  0.013  0.006 -0.238  0.083 -0.089 -0.38   0.74 ]
 [ 0.364 -0.305 -0.123  0.013  0.019 -0.332 -0.261 -0.145  0.747 -0.032]
 [ 0.232  0.402 -0.167 -0.108  0.844  0.062 -0.011 -0.171 -0.006  0.004]
 [ 0.364  0.266  0.058 -0.186 -0.24   0.005  0.804 -0.064  0.219 -0.053]
 [ 0.396  0.104  0.041 -0.167 -0.313  0.601 -0.367 -0.45  -0.081 -0.01 ]
 [ 0.418  0.007 -0.069 -0.073  0.009  0.266 -0.141  0.851  0.022 -0.004]
 [ 0.215  0.368  0.037  0.893 -0.113 -0.062 -0.048 -0.016 -0.009  0.001]
 [ 0.072  0.572  0.114 -0.349 -0.265 -0.568 -0.345  0.065 -0.13   0.007]]

Select the Axis Components¶

# Select top k eigenvectors
k = 3
W = eig_vecs_[:, :k] # Projection matrix
print(W)
print(W.shape)

[[ 0.364 -0.314 -0.124]
 [ 0.154 -0.147  0.951]
 [ 0.376 -0.285 -0.114]
 [ 0.364 -0.305 -0.123]
 [ 0.232  0.402 -0.167]
 [ 0.364  0.266  0.058]
 [ 0.396  0.104  0.041]
 [ 0.418  0.007 -0.069]
 [ 0.215  0.368  0.037]
 [ 0.072  0.572  0.114]]
(10, 3)

Project the Dataset to the New Axis Components¶

X_proj = X_std.dot(W)
print(X_proj.shape)

(569, 3)

%matplotlib inline

fig = plt.figure(figsize=(4,4));
ax = fig.add_subplot(projection='3d');

ax.scatter(X_proj[:, 0], X_proj[:, 1],  X_proj[:, 2], c = y);

ax.set_xlabel('PC1');
ax.set_xticks([]);
ax.set_ylabel('PC2', rotation=90);
ax.set_yticks([]);
ax.set_zlabel('PC3', rotation=90);
ax.set_zticks([]);

_images/790b17a78486dabb5ac5db81967cf48cebe23daf0b23be0dd8ae4a42de271504.png

Explained Variance¶

%matplotlib inline

eig_vals_total = sum(eig_vals)
explained_variance = [(i / eig_vals_total)*100 for i in eig_vals]
explained_variance = np.round(explained_variance, 2)
cum_explained_variance = np.cumsum(explained_variance)

print('Explained variance: {}'.format(explained_variance))
print('Cumulative explained variance: {}'.format(cum_explained_variance))

fig, ax = plt.subplots(figsize= (3, 3))

ax.plot(np.arange(1,n_features+1), cum_explained_variance, '-o')
ax.set_xticks(np.arange(1,n_features+1))
ax.set_xlabel('Number of components')
ax.set_ylabel('Cumulative explained variance');

Explained variance: [54.79 25.19  8.81  4.99  3.73  1.24  0.8   0.35  0.11  0.  ]
Cumulative explained variance: [ 54.79  79.98  88.79  93.78  97.51  98.75  99.55  99.9  100.01 100.01]

_images/983e476ce91b9beb64f42947c89b2d4936a0d886bc24fb1ef4f537b6cad0fdb5.png

[OPTIONAL] Comparison to scikit-learn¶

from sklearn.decomposition import PCA

pca = PCA(n_components=3)
pca.fit_transform(X_std)

print('by sklearn:')
print(pca.components_.transpose())

print('ours:')
print(W)

by sklearn:
[[ 0.364 -0.314 -0.124]
 [ 0.154 -0.147  0.951]
 [ 0.376 -0.285 -0.114]
 [ 0.364 -0.305 -0.123]
 [ 0.232  0.402 -0.167]
 [ 0.364  0.266  0.058]
 [ 0.396  0.104  0.041]
 [ 0.418  0.007 -0.069]
 [ 0.215  0.368  0.037]
 [ 0.072  0.572  0.114]]
ours:
[[ 0.364 -0.314 -0.124]
 [ 0.154 -0.147  0.951]
 [ 0.376 -0.285 -0.114]
 [ 0.364 -0.305 -0.123]
 [ 0.232  0.402 -0.167]
 [ 0.364  0.266  0.058]
 [ 0.396  0.104  0.041]
 [ 0.418  0.007 -0.069]
 [ 0.215  0.368  0.037]
 [ 0.072  0.572  0.114]]

Principle Component Analysis¶

What is PCA?¶

Preparations¶

Iris Dataset¶

Load the Dataset¶

Standardization¶

Covariance Matrix¶

Singular Value Decompostion¶

Correct the Eigen-vector Matrix¶

Select the Axis Components¶

Project the Dataset to the New Axis Components¶

Explained Variance¶

[OPTIONAL] Comparison to scikit-learn¶

Breast Cancer Wisconsin¶

Load the Dataset¶

Standardization¶

Covariance Matrix¶

Singular Value Decomposition¶

Correct the Eigen-vector Matrix¶

Select the Axis Components¶

Project the Dataset to the New Axis Components¶

Explained Variance¶

[OPTIONAL] Comparison to scikit-learn¶

References:¶

Table of Contents

Previous topic

Next topic

This Page