[英]How to test correlation between two sets in python?
I have two different dataframe and one of them is as below我有两个不同的数据框,其中之一如下
df1= df1=
Datetime BSL
0 7 127.504505
1 8 115.254132
2 9 108.994275
3 10 102.936860
4 11 99.830400
5 12 114.660522
6 13 138.215339
7 14 132.131075
8 15 121.478006
9 16 113.795645
10 17 114.038462
the other one is df2=另一个是 df2=
Datetime Number of Accident
0 7 3455
1 8 17388
2 9 27767
3 10 33622
4 11 33474
5 12 12670
6 13 28137
7 14 27141
8 15 26515
9 16 24849
10 17 13013
the first one Blood Sugar Level of people based on time (7 means between 7 am and 8 am) the second one is number of accident between these times第一个是基于时间的人的血糖水平(7 表示早上 7 点到早上 8 点之间)第二个是这些时间之间的事故次数
when I try to this code当我尝试使用此代码时
df1.corr(df2, "pearson")
I got as error:我得到了错误:
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
How can I solve it?我该如何解决? Or, how can I test correlation between two different variables?
或者,我如何测试两个不同变量之间的相关性?
from scipy.stats import pearsonr
df_full = df1.merge(df2,how='left')
full_correlation = pearsonr(df_full['BSL'],df_full['Accidents'])
print('Correlation coefficient:',full_correlation[0])
print('P-value:',full_correlation[1])
Output:输出:
(-0.2934597230564072, 0.3811116115819819)
Correlation coefficient: -0.2934597230564072
P-value: 0.3811116115819819
You want an hourly correlation, but it is impossible mathematically because you have only 1 xy value for each hour.您想要每小时的相关性,但在数学上是不可能的,因为每小时只有 1 个 xy 值。 Therefore the output will be full of NaNs.
因此,输出将充满 NaN。 This is the code, however the output is invalid:
这是代码,但输出无效:
df_corr = df_full.groupby('Datetime')['BSL','Accidents'].corr().drop(columns='BSL').drop('Accidents',level=1).rename(columns={'Accidents':'Correlation'})
print(df_corr)
Output:输出:
Correlation
Datetime
7 BSL NaN
8 BSL NaN
9 BSL NaN
10 BSL NaN
11 BSL NaN
12 BSL NaN
13 BSL NaN
14 BSL NaN
15 BSL NaN
16 BSL NaN
17 BSL NaN
由于您的数据框有多个列,您需要指定要使用的列的名称:
df1['BSL'].corr(df2['Number of Accident'], "pearson")
The corr()
method of a pandas dataframe calculates a correlation matrix for all columns in one dataframe. pandas 数据帧的
corr()
方法计算一个数据帧中所有列的相关矩阵。 You have two dataframes, so that method won't work.您有两个数据框,因此该方法不起作用。 You can solve this by doing:
您可以通过执行以下操作来解决此问题:
df1['number'] = df2['Number of Accident']
df1.corr("pearson")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.