熊貓系列與整個DataFrame之間的關聯

Question

我有一系列值，並且正在計算給定表的每一行的皮爾遜相關性。

我該怎么做？

例：

import pandas as pd

v = [-1, 5, 0, 0, 10, 0, -7]
v1 = [1, 0, 0, 0, 0, 0, 0]
v2 = [0, 1, 0, 0, 1, 0, 0]
v3 = [1, 1, 0, 0, 0, 0, 1]

s = pd.Series(v)
df = pd.DataFrame([v1, v2, v3], columns=['a', 'b', 'c', 'd', 'e', 'f', 'g'])

# Here I expect ot do df.corrwith(s) - but won't work

使用Series.corr()計算，預期輸出為

-0.1666666666666666  # correlation with the first row
0.83914639167827343  # correlation with the second row
-0.35355339059327379 # correlation with the third row

Answer 1

你需要相同index的Series作為columns的DataFrame進行對齊Series的DataFrame ，並添加axis=1在corrwith進行行的相關性：

s1 = pd.Series(s.values, index=df.columns)
print (s1)
a    -1
b     5
c     0
d     0
e    10
f     0
g    -7
dtype: int64

print (df.corrwith(s1, axis=1))
0   -0.166667
1    0.839146
2   -0.353553
dtype: float64

print (df.corrwith(pd.Series(v, index=df.columns), axis=1))
0   -0.166667
1    0.839146
2   -0.353553
dtype: float64

編輯：

您可以指定列並使用子集：

cols = ['a','b','e']

print (df[cols])
   a  b  e
0  1  0  0
1  0  1  1
2  1  1  0

print (df[cols].corrwith(pd.Series(v, index=df.columns), axis=1))
0   -0.891042
1    0.891042
2   -0.838628
dtype: float64

Answer 2

這可能對那些關心性能的人有用。 與熊貓corrwith相比，我發現這種運行時間減少了一半。

您的數據：

import pandas as pd
v = [-1, 5, 0, 0, 10, 0, -7]
v1 = [1, 0, 0, 0, 0, 0, 0]
v2 = [0, 1, 0, 0, 1, 0, 0]
v3 = [1, 1, 0, 0, 0, 0, 1]    
df = pd.DataFrame([v1, v2, v3], columns=['a', 'b', 'c', 'd', 'e', 'f', 'g'])

解決方案（請注意，v不會轉換為序列）：

from scipy.stats.stats import pearsonr
s_corrs = df.apply(lambda x: pearsonr(x.values, v)[0], axis=1)

熊貓系列與整個DataFrame之間的關聯

問題描述

2 個解決方案

解決方案1
5 已采納 2017-01-23 12:46:22

解決方案2
0 2019-02-12 17:35:47

熊貓系列與整個DataFrame之間的關聯

問題描述

2 個解決方案

解決方案1 5 已采納 2017-01-23 12:46:22

解決方案2 0 2019-02-12 17:35:47

解決方案1
5 已采納 2017-01-23 12:46:22

解決方案2
0 2019-02-12 17:35:47