简体   繁体   English

Groupby + DataFrame与系列之间的关联

[英]Groupby + correlation between DataFrame and Series

I have a DataFrame a and Series b . 我有一个DataFrame a和Series b I want to find conditional correlation of each column of a to b , conditional on the value of b . 我想找到的每列的条件相关ab ,上值条件b Specifically, I'm using pd.cut to break up b into 5 groups. 具体来说,我正在使用pd.cutb分成5组。 But instead of a standard quantile, I'm using standard deviations of b above or below the mean. 但是我使用的不是标准分位数,而是使用均值之上或之下的b标准偏差。

np.random.seed(123)

a = (pd.DataFrame(np.random.randn(1000,3))
     .add_prefix('col'))
b = pd.Series(np.random.randn(1000))

mu, sigma = b.mean(), b.std()
breakpoints = mu + np.array([-2., -1., 1., 2.]) * sigma
breakpoints = np.append(np.insert(breakpoints, 0, -np.inf), np.inf)
# There are now 6 breakpoints to create 5 groupings:
# array([       -inf, -1.91260048, -0.9230609 ,  1.05601827,  2.04555785,
#                inf])

labels = ['[-inf,-2]', '(-2,-1]', '(-1,1]', '(1,2]', '(2,inf]']
groups = pd.cut(b, bins=breakpoints, labels=labels)

All is good through here. 通过这里一切都很好。 I'm hung up on the final line, using .corrwith with .groupby , which throws a ValueError : 我挂在最后一行,将.corrwith.groupby .corrwith使用,这会引发ValueError

a.groupby(groups).corrwith(b.groupby(groups))

Any ideas? 有任何想法吗? The result of a.corrwith(b) is a Series, so I'm thinking the result here should be a DataFrame with the groups/buckets as columns. a.corrwith(b)的结果是一个Series,所以我认为这里的结果应该是一个以组/存储桶为列的DataFrame。 For example, one column would be: 例如,一列将是:

print(a[b < breakpoints[1]].corrwith(b[b < breakpoints[1]]))
# Correlation conditional on that `b` is [-inf, -2 stdev]
col0    0.43708
col1   -0.08440
col2   -0.02923
dtype: float64

One solution that's functional but not pretty: 一种有效但不美观的解决方案:

full = a.join(b.to_frame(name='_drop'))
corrs = (full.groupby(groups)
         .corr()
         .loc[(slice(None), a.columns), '_drop']
         .unstack()
         .T)

print(corrs)
      [-inf,-2]  (-2,-1]   (-1,1]    (1,2]  (2,inf]
col0    0.43708  0.06716  0.02437  0.01695  0.05384
col1   -0.08440  0.04208  0.05529 -0.07146  0.14766
col2   -0.02923 -0.19672  0.01519 -0.02290 -0.17101

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM