简体   繁体   English

PCA 解释方差在数据排列上是相同的

[英]PCA explained variance is the same on permutations of data

I am following this tutorial, which compares the explained variance in the top 50 PC's of a dataset to the top 50 PC's of several permutations of that same dataset.我正在关注本教程,该教程将数据集的前 50 台 PC 中的解释方差与同一数据集的多个排列的前 50 台 PC 进行了比较。 It appears they only permute by the columns.看来他们只按列排列。

https://towardsdatascience.com/how-to-tune-hyperparameters-of-tsne-7c0596a18868 https://towardsdatascience.com/how-to-tune-hyperparameters-of-tsne-7c0596a18868

I tried to replicate this in python, but I'm getting the exact same explained variance for all permutations.我试图在 python 中复制它,但我得到了所有排列完全相同的解释方差。 Can someone help me understand why my permuted data's permutations explained variance are exactly the same?有人可以帮助我理解为什么我的排列数据的排列解释方差完全相同吗?

def exp_var_perm_data(data, n_permutations=1):
    """
        data: Assumed to be a pandas dataframe, object that has a .shape attribute
        n_permutations: Integer. Number of permutations to perform
    """    
    df = pd.DataFrame(columns=["Dim%d" % i for i in range(0, data.shape[1])])
    for k in range(0,n_permutations):
        pca_permuted = PCA()
        data_permuted = data.sample(frac=1).reset_index()
        pca_permuted.fit(data_permuted)
        df.loc[k] = pca_permuted.explained_variance_ratio_
    return df

from sklearn import datasets
import pandas as pd

iris_data = datasets.load_iris()
iris_data = iris_data.data

exp_var_perm = exp_var_perm_data(pd.DataFrame(iris_data), 10)
print(exp_var_perm)

Output:输出:

       Dim0      Dim1      Dim2      Dim3
0  0.879444  0.093535  0.021659  0.005363
1  0.879444  0.093535  0.021659  0.005363
2  0.879444  0.093535  0.021659  0.005363
3  0.879444  0.093535  0.021659  0.005363
4  0.879444  0.093535  0.021659  0.005363
5  0.879444  0.093535  0.021659  0.005363
6  0.879444  0.093535  0.021659  0.005363
7  0.879444  0.093535  0.021659  0.005363
8  0.879444  0.093535  0.021659  0.005363
9  0.879444  0.093535  0.021659  0.005363

The tutorial permutes each column independently, as far as I can read R code:就我可以阅读的 R 代码而言,本教程独立地排列每一列:

expr_perm <- apply(expr,2,sample)

This seems reasonable, as the goal is to generate data under the null hypothesis of zero covariance.这似乎是合理的,因为目标是在零协方差的零假设下生成数据。

However, the corresponding code in the question permutes the whole dataframe (all columns together):但是,问题中的相应代码排列整个数据框(所有列在一起):

data_permuted = data.sample(frac=1).reset_index(drop=True)

Similar to the R code, we can use apply to permute each column (using a small helper function to do the permutation):与 R 代码类似,我们可以使用apply对每一列进行置换(使用一个小的辅助函数进行置换):

data_permuted = data.apply(permute, axis=1, raw=True)

Here is the fully working example:这是完整的工作示例:

from sklearn.decomposition import PCA
from sklearn import datasets
import pandas as pd
import numpy as np


def exp_var_perm_data(data, n_permutations=1):
    """
        data: Assumed to be a pandas dataframe, object that has a .shape attribute
        n_permutations: Integer. Number of permutations to perform
    """    
    df = pd.DataFrame(columns=["Dim%d" % i for i in range(0, data.shape[1])])
    for k in range(0,n_permutations):
        pca_permuted = PCA()
        data_permuted = data.apply(permute, axis=1, raw=True)
        pca_permuted.fit(data_permuted)
        df.loc[k] = pca_permuted.explained_variance_ratio_
    return df


def permute(x):
    """Create a randomly permuted copy of x"""
    x = x.copy()
    np.random.shuffle(x)
    return x


iris_data = datasets.load_iris()
iris_data = iris_data.data

exp_var_perm = exp_var_perm_data(pd.DataFrame(iris_data), 10)
print(exp_var_perm)

I've been following the same tutorial.我一直在关注相同的教程。

I just wrote a function to shuffle the data as indicated in the tutorial:我刚刚编写了一个函数来按照教程中的说明对数据进行洗牌:

def shuffle_by_rows(data):
   shuffled = np.zeros((data.shape[0],data.shape[1]))
   for i,row in enumerate(data):
      shuffled[i] = random.sample(row.tolist(),len(row))
   return shuffled

Made a difference for me.对我产生了影响。 Better yet, sklearn and numpy can find the best number of components by:更好的是,sklearn 和 numpy 可以通过以下方式找到最佳组件数量:

pca = PCA()
pca.fit(data)
cumsum = np.cumsum(pca.explained_varaince_ratio_)
d = np.argmax(cumsum>=.95)+1

This is to get best number of components that add up to 95% of the variance in the data.这是为了获得加起来达到数据方差 95% 的最佳分量数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM