简体   繁体   English

按行创建相关矩阵:Pandas

[英]Create Correlation Matrix by rows: Pandas

I want to create a correlation matrix by rows.我想按行创建一个相关矩阵。 Here's how my df looks like:这是我的 df 的样子:

df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
                   columns=['a', 'b', 'c'],index = ["doc1", "doc2", "doc3"])

#Output
      a  b  c
doc1  1  2  3
doc2  4  5  6
doc3  7  8  9

I want to find the correlation between documents.我想找到文档之间的相关性。 I used我用了

corrMatrix = df.corr()

but it gives me correlation between each cell (I think).但它给了我每个细胞之间的相关性(我认为)。 The other approach that I have considered is to simply subset each of the document and then use我考虑过的另一种方法是简单地对每个文档进行子集化,然后使用

np.corrcoef(doc1,doc2)

and manually create a 2D numpy array.并手动创建一个二维 numpy 阵列。 Any ideas where I can do this elegantly?有什么想法可以优雅地做到这一点吗?

DataFrame.corr() finds the correlation between pairs of columns . DataFrame.corr()查找对之间的相关性。 If you want rows, transpose first.如果您想要行,请先转置。 (I modified your data slightly so everything isn't perfectly correlated) (我稍微修改了你的数据,所以一切都不是完全相关的)

import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1, 2, 8], [4, 5, 6], [5, 8, 9]]),
                  columns=['a', 'b', 'c'], index=["doc1", "doc2", "doc3"])

df.T.corr()

          doc1      doc2      doc3
doc1  1.000000  0.924473  0.782467
doc2  0.924473  1.000000  0.960769
doc3  0.782467  0.960769  1.000000

Or use np.corrcoef on the non-transposed DataFrame.或者在非转置 DataFrame 上使用np.corrcoef This will be a lot faster than the above with a large DataFrame since you avoid the unnecessary transpose.这将比使用大 DataFrame 的上述方法快得多,因为您避免了不必要的转置。

np.corrcoef(df)

array([[1.        , 0.92447345, 0.78246663],
       [0.92447345, 1.        , 0.96076892],
       [0.78246663, 0.96076892, 1.        ]])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM