[英]Create Correlation Matrix by rows: Pandas
I want to create a correlation matrix by rows.我想按行创建一个相关矩阵。 Here's how my df looks like:这是我的 df 的样子:
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'],index = ["doc1", "doc2", "doc3"])
#Output
a b c
doc1 1 2 3
doc2 4 5 6
doc3 7 8 9
I want to find the correlation between documents.我想找到文档之间的相关性。 I used我用了
corrMatrix = df.corr()
but it gives me correlation between each cell (I think).但它给了我每个细胞之间的相关性(我认为)。 The other approach that I have considered is to simply subset each of the document and then use我考虑过的另一种方法是简单地对每个文档进行子集化,然后使用
np.corrcoef(doc1,doc2)
and manually create a 2D numpy array.并手动创建一个二维 numpy 阵列。 Any ideas where I can do this elegantly?有什么想法可以优雅地做到这一点吗?
DataFrame.corr()
finds the correlation between pairs of columns . DataFrame.corr()
查找列对之间的相关性。 If you want rows, transpose first.如果您想要行,请先转置。 (I modified your data slightly so everything isn't perfectly correlated) (我稍微修改了你的数据,所以一切都不是完全相关的)
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1, 2, 8], [4, 5, 6], [5, 8, 9]]),
columns=['a', 'b', 'c'], index=["doc1", "doc2", "doc3"])
df.T.corr()
doc1 doc2 doc3
doc1 1.000000 0.924473 0.782467
doc2 0.924473 1.000000 0.960769
doc3 0.782467 0.960769 1.000000
Or use np.corrcoef
on the non-transposed DataFrame.或者在非转置 DataFrame 上使用np.corrcoef
。 This will be a lot faster than the above with a large DataFrame since you avoid the unnecessary transpose.这将比使用大 DataFrame 的上述方法快得多,因为您避免了不必要的转置。
np.corrcoef(df)
array([[1. , 0.92447345, 0.78246663],
[0.92447345, 1. , 0.96076892],
[0.78246663, 0.96076892, 1. ]])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.