[英]Pandas dataframe.resample std on certain columns only?
I am trying to calculate monthly standard deviations for my data. 我正在尝试为我的数据计算每月标准差。
My data is loaded from a database into a dataframe with the following columns: 我的数据从数据库加载到具有以下列的数据帧中:
measurement_time, level, mixing_ratio, concentration
To calculate the monthly standard deviations, I do the following: 要计算月度标准偏差,请执行以下操作:
df_std = df_std.set_index('measurement_time')
df_std = df_std.groupby(['level'], as_index=False)
df_std = df_std.resample('M').std()
The result is: 结果是:
>> df_std.head()
level mixing_ratio concentration
measurement_time
0 2016-01-31 0.0 3.435376e-11 0.000015
2016-02-29 0.0 2.692636e-11 0.000012
2016-03-31 0.0 6.709993e-11 0.000029
2016-04-30 0.0 3.338249e-11 0.000014
2016-05-31 0.0 3.916523e-11 0.000017
The problem is that it is calculating the standard deviation on the level
, too, while I only want the calculation performed on mixing_ratio
& concentration
问题是,它是计算的标准差
level
,太,而我只希望在执行计算mixing_ratio
和concentration
The result should be the monthly standard deviations at each level. 结果应该是每个级别的每月标准偏差。 If I had 7 levels, I would expect my dataframe to have 84 rows (7 * 12 months).
如果我有7个级别,我希望我的数据框有84行(7 * 12个月)。
How can I fix this? 我怎样才能解决这个问题?
When you resample you have to do something with the aggregated data. 重新采样时,您必须对汇总数据进行处理。
Example of different functions for different columns: 不同列的不同功能的示例:
df_resample = df_std.resample('M').agg({'level': np.mean, 'mixing_ratio': np.std, 'concentration':np.std})
Other option is to groupby time and level. 其他选项是按时间和级别分组。
SamuelNLP's answer is perfect for the case of different calculations on the DataFrame, as is presented in his example! SamuelNLP的答案非常适合在DataFrame上进行不同计算的情况,如他的示例所示!
But I had a very similar problem as pookie, a large DataFrame (length>120'000'000), where I needed the std of all but one column over every 5 minute period. 但是我有一个与pookie非常相似的问题,一个大型的DataFrame(长度> 120'000'000),我需要每5分钟间隔除一列外的所有std。 And in this specific case, there is a simple but faster alternative:
在这种特定情况下,有一个简单但较快的替代方法:
data_std = data.resample('5min').std()
data_std.drop('temperature', axis=1)
I just calculated the std after the resampling, and drop the columns that I didn't need. 我只是在重新采样后计算了std,然后删除了不需要的列。
>> data.head()
x y z temperature
time
2018-02-21 11:00:06.354 0.606 0.764 -0.499 21.163203
2018-02-21 11:00:06.364 -0.762 0.127 -0.499 21.163203
2018-02-21 11:00:06.374 -0.793 0.143 -0.482 21.163203
2018-02-21 11:00:06.384 -0.809 0.064 -0.418 21.163203
2018-02-21 11:00:06.394 -0.825 -0.016 -0.401 21.163203
>> data_std = data.resample('5min').std()
>> data_std.drop('temperature', axis=1)
>> data_std.head()
x y z
time
2018-02-21 11:00:05 0.260700 0.192227 0.244653
2018-02-21 11:00:10 0.125168 0.164327 0.116562
2018-02-21 11:00:15 0.138330 0.154963 0.126264
2018-02-21 11:00:20 0.182339 0.204350 0.226019
2018-02-21 11:00:25 0.193661 0.107022 0.133125
Speed-up: I tested the two options, over different DataFrame lengths, each 50 times with a simple time.time() measurement. 加速:我测试了这两个选项,它们在不同的DataFrame长度上,分别用一个简单的time.time()测量了50次。
# resample-drop method
tmp = data.resample('5s').std()
tmp.drop('temperature', axis=1, inplace=True)
# aggregate method
tmp = data.resample('5s').agg({'x': np.std, 'y':np.std, 'z':np.std})
The resample-drop method is 20%-25% faster than the aggregate method. 重采样法比聚合法快20%-25%。
I assume this runtime difference is due to a more efficient implementation of the simple resample-drop part as compared to the aggregate. 我认为这种运行时差异是由于与聚合相比,更简单的resample-drop部分实现更为有效。 But I would be glad for more explanations.
但我很乐意提供更多解释。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.