概述

Principal Component Analysis 即主成分分析的主要思想是将 n 维特征映射到 k 维上，这 k 维是全新的正交特征也被称为主成分，是在原有 n 维特征的基础上重新构造出来的 k 维特征。PCA 的工作就是从原始的空间中顺序地找一组相互正交的坐标轴，其中，第一个新坐标轴选择是原始数据中方差最大的方向，第二个新坐标轴选取是与第一个坐标轴正交的平面中使得方差最大的，第三个轴是与第 1,2 个轴正交的平面中方差最大的，依次类推，得到 n 个这样的坐标轴。通过这种方式获得的新的坐标轴，大部分方差都包含在前面 k 个坐标轴中，后面的坐标轴所含的方差几乎为 0 。于是，我们可以只保留前 k 维，从而实现对数据特征的降维处理。

PCA 的计算有两种方式：

分解协方差矩阵：具体实现过程主要包括三步：
1. 计算数据矩阵的协方差矩阵
2. 计算协方差矩阵的特征值特征向量
3. 选择特征值最大（即方差最大）的 k 个特征所对应的特征向量组成的矩阵
分解内积矩阵

分解协方差矩阵

协方差矩阵

协方差用来刻画两个随机变量之间的相关性。简单比较一下方差与协方差：

均值：

$\bar x=\frac1n\sum\limits^N_{i=1}x_i$

样本方差：

$S^2=\frac1{n-1}\sum\limits^N_{i=1}(\bar x-x_i)^2$

样本 $X$ 和样本 $Y$ 的协方差：

$Cov(X,Y)=\frac1{n-1}\sum\limits^N_{i=1}(\bar x-x_i)(\bar y-y_i)$

我们可以发现：方差的计算是针对一维特征，而协方差则要求至少二维。方差是协方差的特殊情况。协方差为正时，说明 $X,Y$ 是正相关关系，为负时，则说明是负相关关系，为 0 则说明两个样本相互独立。当有多个样本时，两两之间的协方差就构成了协方差矩阵。

协方差矩阵与散度矩阵关系密切，由散度矩阵的定义：

$S=\sum\limits^N_{i=1}(\vec x_i-\bar x)(\vec x_i-\bar x)^T$

我们发现，对于一组数据 $X$： $S=(n-1)Cov(X)$，因此它们有相同的特征值和特征向量。

特征值分解和 SVD

详细过程可以参考我的另一篇文章：特征分解和奇异值分解
计算出 n 个特征向量后，我们对其排序，并取前 k 个即可。

分解内积矩阵

_在 sklearn 库中的 PCA 就是使用此方法计算的。_

假设 $X$ 矩阵是一个样本集，它已经中心化过（PCA 要求数据中心化），那么协方差矩阵就是：$\frac1nXX^T$
我们对 $X$ 进行 SVD 分解：$X=U\Sigma V^T$，则 $X^T=V\Sigma U^T$
因此就有：$\frac1nXX^T=U\Sigma^2U^T$

因此，我们可以对 $X$ 进行 SVD 分解，然后求 $\Sigma$ 的内积来计算特征向量。

关于源码

花一晚上阅读了 sklearn 的官方代码，大概记录一下。源码篇幅过长，所以仅记录重点部分。

计算 PCA

通过 svd_solver 参数，我们可以选择不同方式对数据进行求解，但是基本思路相同，这里仅以 svd_solver == 'full' 也就是使用所有数据进行特征分解这一情况为例。

fit() 方法的实现：
其中，explained_variance_ 为方差值，explained_variance_ratio_ 为每个方程的贡献率，也就是占比，可以通过这个参数来判断 n 具体取值。

注意，n_components 参数可以取好几种值：

“mle”：自动确定 n 值。
$0<\text{n_components}<1$：表示希望保留的信息量的比例。PCA 会自动选使得信息量 $\ge\text{n_components}$ 的特征数量。
$\text{n_components} \ge1$ 且为整数：选前 n 个

def _fit_full(self, X, n_components):
    n_samples, n_features = X.shape
    
    ... # 对传入的数据进行验证，暂且省略
    
    # Center data
    self.mean_ = np.mean(X, axis=0)
    X -= self.mean_
    
    # 使用分解内积矩阵方法求特征向量
    U, S, Vt = linalg.svd(X, full_matrices=False)
    U, Vt = svd_flip(U, Vt) # 由于有时候可能求得负值，所以翻转特征向量符号以强制执行确定性输出
    components_ = Vt
    
    explained_variance_ = (S**2) / (n_samples - 1)
    total_var = explained_variance_.sum()
    explained_variance_ratio_ = explained_variance_ / total_var
    singular_values_ = S.copy()  # Store the singular values.
    
    # 根据 n_components 确定取几个特征，也就是 n 的值。
    if n_components == "mle":
        n_components = _infer_dimension(explained_variance_, n_samples)
    elif 0 < n_components < 1.0:
        ratio_cumsum = stable_cumsum(explained_variance_ratio_)
        n_components = np.searchsorted(ratio_cumsum, n_components, side="right") + 1
    # Compute noise covariance using Probabilistic PCA model
    # The sigma2 maximum likelihood (cf. eq. 12.46)
    if n_components < min(n_features, n_samples):
        self.noise_variance_ = explained_variance_[n_components:].mean()
    else:
        self.noise_variance_ = 0.0
    
    ... # 对参数进行赋值，暂且省略
    
    return U, S, Vt

transform() 方法的实现：

这里运用了白化操作，即对数据进行去相关，以降低输入数据的冗余性，白化后的数据有两个特性：消除了特征之间的相关性，且所有特征的方差都为 1。

def transform(self, X):
        ... # 对程序和数据进行验证，暂且忽略
        
        if self.mean_ is not None:
            X = X - self.mean_
        X_transformed = np.dot(X, self.components_.T)
        # 白化
        if self.whiten:
            X_transformed /= np.sqrt(self.explained_variance_)
        return X_transformed

fit_transform() 方法的实现：

相关的公式都被作者贴心的写在注释里了。

def fit_transform(self, X, y=None):
        U, S, Vt = self._fit(X)
        U = U[:, : self.n_components_]

        if self.whiten:
            # X_new = X * V / S * sqrt(n_samples) = U * sqrt(n_samples)
            U *= sqrt(X.shape[0] - 1)
        else:
            # X_new = X * V = U * S * Vt * V = U * S
            U *= S[: self.n_components_]
        return U

自动选择 n 的值

通过计算对数似然的方法来判断 n 具体该取几，也就是保留几个维度。相关论文可见此链接

def _assess_dimension(spectrum, rank, n_samples):
    """Compute the log-likelihood of a rank ``rank`` dataset.
    The dataset is assumed to be embedded in gaussian noise of shape(n,
    dimf) having spectrum ``spectrum``.
    Parameters
    ----------
    spectrum : ndarray of shape (n_features,)
        Data spectrum.
    rank : int
        Tested rank value. It should be strictly lower than n_features,
        otherwise the method isn't specified (division by zero in equation
        (31) from the paper).
    n_samples : int
        Number of samples.
    Returns
    -------
    ll : float
        The log-likelihood.
    """

    n_features = spectrum.shape[0]
    if not 1 <= rank < n_features:
        raise ValueError("the tested rank should be in [1, n_features - 1]")

    eps = 1e-15

    if spectrum[rank - 1] < eps:
        # When the tested rank is associated with a small eigenvalue, there's
        # no point in computing the log-likelihood: it's going to be very
        # small and won't be the max anyway. Also, it can lead to numerical
        # issues below when computing pa, in particular in log((spectrum[i] -
        # spectrum[j]) because this will take the log of something very small.
        return -np.inf

    pu = -rank * log(2.0)
    for i in range(1, rank + 1):
        pu += (
            gammaln((n_features - i + 1) / 2.0)
            - log(np.pi) * (n_features - i + 1) / 2.0
        )

    pl = np.sum(np.log(spectrum[:rank]))
    pl = -pl * n_samples / 2.0

    v = max(eps, np.sum(spectrum[rank:]) / (n_features - rank))
    pv = -np.log(v) * n_samples * (n_features - rank) / 2.0

    m = n_features * rank - rank * (rank + 1.0) / 2.0
    pp = log(2.0 * np.pi) * (m + rank) / 2.0

    pa = 0.0
    spectrum_ = spectrum.copy()
    spectrum_[rank:n_features] = v
    for i in range(rank):
        for j in range(i + 1, len(spectrum)):
            pa += log(
                (spectrum[i] - spectrum[j]) * (1.0 / spectrum_[j] - 1.0 / spectrum_[i])
            ) + log(n_samples)

    ll = pu + pl + pv + pp - pa / 2.0 - rank * log(n_samples) / 2.0
    return ll


def _infer_dimension(spectrum, n_samples):
    """Infers the dimension of a dataset with a given spectrum.
    The returned value will be in [1, n_features - 1].
    """
    ll = np.empty_like(spectrum)
    ll[0] = -np.inf  # we don't want to return n_components = 0
    for rank in range(1, spectrum.shape[0]):
        ll[rank] = _assess_dimension(spectrum, rank, n_samples)
    return ll.argmax()

样本得分

通过计算每个样本的对数似然来反应性能。相关论文可见此链接

def get_covariance(self):
    """Compute data covariance with the generative model.
    ``cov = components_.T * S**2 * components_ + sigma2 * eye(n_features)``
    where S**2 contains the explained variances, and sigma2 contains the
    noise variances.
    Returns
    -------
    cov : array of shape=(n_features, n_features)
        Estimated covariance of data.
    """
    components_ = self.components_
    exp_var = self.explained_variance_
    if self.whiten:
        components_ = components_ * np.sqrt(exp_var[:, np.newaxis])
    exp_var_diff = np.maximum(exp_var - self.noise_variance_, 0.0)
    cov = np.dot(components_.T * exp_var_diff, components_)
    cov.flat[:: len(cov) + 1] += self.noise_variance_  # modify diag inplace
    return cov

def get_precision(self):
    """Compute data precision matrix with the generative model.
    Equals the inverse of the covariance but computed with
    the matrix inversion lemma for efficiency.
    Returns
    -------
    precision : array, shape=(n_features, n_features)
        Estimated precision of data.
    """
    n_features = self.components_.shape[1]

    # handle corner cases first
    if self.n_components_ == 0:
        return np.eye(n_features) / self.noise_variance_

    if np.isclose(self.noise_variance_, 0.0, atol=0.0):
        return linalg.inv(self.get_covariance())

    # Get precision using matrix inversion lemma
    components_ = self.components_
    exp_var = self.explained_variance_
    if self.whiten:
        components_ = components_ * np.sqrt(exp_var[:, np.newaxis])
    exp_var_diff = np.maximum(exp_var - self.noise_variance_, 0.0)
    precision = np.dot(components_, components_.T) / self.noise_variance_
    precision.flat[:: len(precision) + 1] += 1.0 / exp_var_diff
    precision = np.dot(components_.T, np.dot(linalg.inv(precision), components_))
    precision /= -(self.noise_variance_**2)
    precision.flat[:: len(precision) + 1] += 1.0 / self.noise_variance_
    return precision

def score_samples(self, X):
    """Return the log-likelihood of each sample.
    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        The data.
    """
    ... # 对程序和数据的验证，暂且忽略
    
    Xr = X - self.mean_
    n_features = X.shape[1]
    precision = self.get_precision()
    log_like = -0.5 * (Xr * (np.dot(Xr, precision))).sum(axis=1)
    log_like -= 0.5 * (n_features * log(2.0 * np.pi) - fast_logdet(precision))
    return log_like

def score(self, X, y=None):
    """Return the average log-likelihood of all samples."""
    return np.mean(self.score_samples(X))