简体   繁体   English

熊猫dataframe.resample std仅在某些列上?

[英]Pandas dataframe.resample std on certain columns only?

I am trying to calculate monthly standard deviations for my data. 我正在尝试为我的数据计算每月标准差。

My data is loaded from a database into a dataframe with the following columns: 我的数据从数据库加载到具有以下列的数据帧中:

measurement_time, level, mixing_ratio, concentration

To calculate the monthly standard deviations, I do the following: 要计算月度标准偏差,请执行以下操作:

    df_std = df_std.set_index('measurement_time')
    df_std = df_std.groupby(['level'], as_index=False)
    df_std = df_std.resample('M').std()

The result is: 结果是:

>> df_std.head()
                    level  mixing_ratio  concentration
  measurement_time                                    
0 2016-01-31          0.0  3.435376e-11       0.000015
  2016-02-29          0.0  2.692636e-11       0.000012
  2016-03-31          0.0  6.709993e-11       0.000029
  2016-04-30          0.0  3.338249e-11       0.000014
  2016-05-31          0.0  3.916523e-11       0.000017

The problem is that it is calculating the standard deviation on the level , too, while I only want the calculation performed on mixing_ratio & concentration 问题是,它是计算的标准差level ,太,而我只希望在执行计算mixing_ratioconcentration

The result should be the monthly standard deviations at each level. 结果应该是每个级别的每月标准偏差。 If I had 7 levels, I would expect my dataframe to have 84 rows (7 * 12 months). 如果我有7个级别,我希望我的数据框有84行(7 * 12个月)。

How can I fix this? 我怎样才能解决这个问题?

When you resample you have to do something with the aggregated data. 重新采样时,您必须对汇总数据进行处理。

Example of different functions for different columns: 不同列的不同功能的示例:

df_resample = df_std.resample('M').agg({'level': np.mean, 'mixing_ratio': np.std, 'concentration':np.std})

Other option is to groupby time and level. 其他选项是按时间和级别分组。

SamuelNLP's answer is perfect for the case of different calculations on the DataFrame, as is presented in his example! SamuelNLP的答案非常适合在DataFrame上进行不同计算的情况,如他的示例所示!

But I had a very similar problem as pookie, a large DataFrame (length>120'000'000), where I needed the std of all but one column over every 5 minute period. 但是我有一个与pookie非常相似的问题,一个大型的DataFrame(长度> 120'000'000),我需要每5分钟间隔除一列外的所有std。 And in this specific case, there is a simple but faster alternative: 在这种特定情况下,有一个简单但较快的替代方法:

data_std = data.resample('5min').std()
data_std.drop('temperature', axis=1)

I just calculated the std after the resampling, and drop the columns that I didn't need. 我只是在重新采样后计算了std,然后删除了不需要的列。

>> data.head()
                                 x      y        z  temperature
time                
2018-02-21 11:00:06.354      0.606  0.764   -0.499  21.163203
2018-02-21 11:00:06.364     -0.762  0.127   -0.499  21.163203
2018-02-21 11:00:06.374     -0.793  0.143   -0.482  21.163203
2018-02-21 11:00:06.384     -0.809  0.064   -0.418  21.163203
2018-02-21 11:00:06.394     -0.825  -0.016  -0.401  21.163203

>> data_std = data.resample('5min').std()
>> data_std.drop('temperature', axis=1)
>> data_std.head()
                               x           y           z
time            
2018-02-21 11:00:05     0.260700    0.192227    0.244653
2018-02-21 11:00:10     0.125168    0.164327    0.116562
2018-02-21 11:00:15     0.138330    0.154963    0.126264
2018-02-21 11:00:20     0.182339    0.204350    0.226019
2018-02-21 11:00:25     0.193661    0.107022    0.133125

Speed-up: I tested the two options, over different DataFrame lengths, each 50 times with a simple time.time() measurement. 加速:我测试了这两个选项,它们在不同的DataFrame长度上,分别用一个简单的time.time()测量了50次。

# resample-drop method
tmp = data.resample('5s').std()
tmp.drop('temperature', axis=1, inplace=True)

# aggregate method
tmp = data.resample('5s').agg({'x': np.std, 'y':np.std, 'z':np.std})

The resample-drop method is 20%-25% faster than the aggregate method. 重采样法比聚合法快20%-25%。 两种方法的运行时,重新运行50次

I assume this runtime difference is due to a more efficient implementation of the simple resample-drop part as compared to the aggregate. 我认为这种运行时差异是由于与聚合相比,更简单的resample-drop部分实现更为有效。 But I would be glad for more explanations. 但我很乐意提供更多解释。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Pandas dataframe.resample TypeError '仅对 DatetimeIndex、TimedeltaIndex 或 PeriodIndex 有效,但获得了“RangeIndex”实例 - Pandas dataframe.resample TypeError 'Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex' 在小型数据集上的熊猫中调用DataFrame.resample()时内存不足 - Out of memory when DataFrame.resample() is called in pandas on small dataset 如何阻止pandas dataframe.resample('T')自动向数据帧添加额外的索引? - How do I stop pandas dataframe.resample('T') from automatically adding extra indexes to dataframe? DataFrame.resample 不包括最后一行 - DataFrame.resample does not include last row 需要熊猫DataFrame.resample()来遵守子周期系列的开始日期时间 - need pandas DataFrame.resample() to honor sub-period series start datetime Python Dataframe.resample()从datetimeindex中删除时间 - Python Dataframe.resample() deletes time from datetimeindex 仅合并 pandas dataframe 的某些列 - Merge only certain columns of pandas dataframe pandas Dataframe 中某些列的总和 - Sum of only certain columns in a pandas Dataframe Pandas 对数据框进行分组或重新采样,不包括列 - Pandas group or resample dataframe excluding columns Pandas DataFrame resample() 和 aggregate() 与 MultiIndex 列 - Pandas DataFrame resample() and aggregate() with MultiIndex columns
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM