熊猫dataframe.resample std仅在某些列上？

Question

I am trying to calculate monthly standard deviations for my data. 我正在尝试为我的数据计算每月标准差。

My data is loaded from a database into a dataframe with the following columns: 我的数据从数据库加载到具有以下列的数据帧中：

measurement_time, level, mixing_ratio, concentration

To calculate the monthly standard deviations, I do the following: 要计算月度标准偏差，请执行以下操作：

    df_std = df_std.set_index('measurement_time')
    df_std = df_std.groupby(['level'], as_index=False)
    df_std = df_std.resample('M').std()

The result is: 结果是：

>> df_std.head()
                    level  mixing_ratio  concentration
  measurement_time                                    
0 2016-01-31          0.0  3.435376e-11       0.000015
  2016-02-29          0.0  2.692636e-11       0.000012
  2016-03-31          0.0  6.709993e-11       0.000029
  2016-04-30          0.0  3.338249e-11       0.000014
  2016-05-31          0.0  3.916523e-11       0.000017

The problem is that it is calculating the standard deviation on the level , too, while I only want the calculation performed on mixing_ratio & concentration 问题是，它是计算的标准差level ，太，而我只希望在执行计算mixing_ratio和concentration

The result should be the monthly standard deviations at each level. 结果应该是每个级别的每月标准偏差。 If I had 7 levels, I would expect my dataframe to have 84 rows (7 * 12 months). 如果我有7个级别，我希望我的数据框有84行（7 * 12个月）。

How can I fix this? 我怎样才能解决这个问题？

Answer 1

When you resample you have to do something with the aggregated data. 重新采样时，您必须对汇总数据进行处理。

Example of different functions for different columns: 不同列的不同功能的示例：

df_resample = df_std.resample('M').agg({'level': np.mean, 'mixing_ratio': np.std, 'concentration':np.std})

Other option is to groupby time and level. 其他选项是按时间和级别分组。

Answer 2

SamuelNLP's answer is perfect for the case of different calculations on the DataFrame, as is presented in his example! SamuelNLP的答案非常适合在DataFrame上进行不同计算的情况，如他的示例所示！

But I had a very similar problem as pookie, a large DataFrame (length>120'000'000), where I needed the std of all but one column over every 5 minute period. 但是我有一个与pookie非常相似的问题，一个大型的DataFrame（长度> 120'000'000），我需要每5分钟间隔除一列外的所有std。 And in this specific case, there is a simple but faster alternative: 在这种特定情况下，有一个简单但较快的替代方法：

data_std = data.resample('5min').std()
data_std.drop('temperature', axis=1)

I just calculated the std after the resampling, and drop the columns that I didn't need. 我只是在重新采样后计算了std，然后删除了不需要的列。

>> data.head()
                                 x      y        z  temperature
time                
2018-02-21 11:00:06.354      0.606  0.764   -0.499  21.163203
2018-02-21 11:00:06.364     -0.762  0.127   -0.499  21.163203
2018-02-21 11:00:06.374     -0.793  0.143   -0.482  21.163203
2018-02-21 11:00:06.384     -0.809  0.064   -0.418  21.163203
2018-02-21 11:00:06.394     -0.825  -0.016  -0.401  21.163203

>> data_std = data.resample('5min').std()
>> data_std.drop('temperature', axis=1)
>> data_std.head()
                               x           y           z
time            
2018-02-21 11:00:05     0.260700    0.192227    0.244653
2018-02-21 11:00:10     0.125168    0.164327    0.116562
2018-02-21 11:00:15     0.138330    0.154963    0.126264
2018-02-21 11:00:20     0.182339    0.204350    0.226019
2018-02-21 11:00:25     0.193661    0.107022    0.133125

Speed-up: I tested the two options, over different DataFrame lengths, each 50 times with a simple time.time() measurement. 加速：我测试了这两个选项，它们在不同的DataFrame长度上，分别用一个简单的time.time（）测量了50次。

# resample-drop method
tmp = data.resample('5s').std()
tmp.drop('temperature', axis=1, inplace=True)

# aggregate method
tmp = data.resample('5s').agg({'x': np.std, 'y':np.std, 'z':np.std})

The resample-drop method is 20%-25% faster than the aggregate method. 重采样法比聚合法快20％-25％。

I assume this runtime difference is due to a more efficient implementation of the simple resample-drop part as compared to the aggregate. 我认为这种运行时差异是由于与聚合相比，更简单的resample-drop部分实现更为有效。 But I would be glad for more explanations. 但我很乐意提供更多解释。

熊猫dataframe.resample std仅在某些列上？

问题描述

2 个解决方案

解决方案1
0 2017-12-07 16:42:53

解决方案2
0 2019-07-09 10:48:17

熊猫dataframe.resample std仅在某些列上？

问题描述

2 个解决方案

解决方案1 0 2017-12-07 16:42:53

解决方案2 0 2019-07-09 10:48:17

解决方案1
0 2017-12-07 16:42:53

解决方案2
0 2019-07-09 10:48:17