简体   繁体   English

在保留某些列的同时,在 groupby 上使用带有开始和结束日期时间的重新采样的最有效方法 - 并在此之后计算统计信息

[英]Most efficient way to use resample on groupby with start and end datetime while preserving certain columns - and calculate statistics after that

I work with huge DataFrames in terms of shape, my example is only a boiled down one.我在形状方面使用巨大的 DataFrame,我的例子只是一个简化的例子。

Lets assume the following scenario:让我们假设以下场景:

# we have these two datetime objects as start and end for my data set
first_day = 2020-03-01 00:00:00
last_day = 2020-03-31 23:59:59

# assume we have a big DataFrame df like this with many, many rows:
              datetime   var1   var2  count1  count2
1  2020-03-01 00:00:01    "A"    "B"       1      12
2  2020-03-01 00:00:01    "C"    "C"       2     179
3  2020-03-01 00:00:01    "C"    "D"       1      72
4  2020-03-01 00:00:02    "C"    "E"       4       7
5  2020-03-01 00:00:02    "D"    "E"       2      47
6  2020-03-01 00:00:02    "H"    "F"       1      31
7  2020-03-01 00:00:02  "ABC"  "DEF"       7      74
8  2020-03-01 00:00:03  "ABC"  "DEF"       3      10
...

# I now want to group on this DataFrame like this:
gb = df.groupby([var1, var2])

# what yields me groups like this as an example:
              datetime   var1   var2  count1  count2
7  2020-03-01 00:00:02  "ABC"  "DEF"       7      74
8  2020-03-01 00:00:03  "ABC"  "DEF"       3      10

What I need to do now is resample every group with the given first_day and last_day and an Offset alias 1S , so I get something like this for each one:我现在需要做的是使用给定的first_daylast_day以及一个 Offset 别名1S每个组进行重新采样,所以我对每个组都得到类似的结果:

              datetime   var1   var2  count1  count2
0  2020-03-01 00:00:00  "ABC"  "DEF"       0       0
1  2020-03-01 00:00:01  "ABC"  "DEF"       0       0
2  2020-03-01 00:00:02  "ABC"  "DEF"       7      74
3  2020-03-01 00:00:03  "ABC"  "DEF"       3      10
4  2020-03-01 00:00:04  "ABC"  "DEF"       0       0
5  2020-03-01 00:00:05  "ABC"  "DEF"       0       0
...
n  2020-03-31 23:59:59  "ABC"  "DEF"       0       0

The tricky part is, the columns of var1 to varN are not allowed to get null'd and need to be preservered, only the columns of count1 to countN need to get null'd.棘手的部分是, var1varN的列不允许为空,需要保留,只有count1countN的列需要为空。 I know, doing so with an offset of 1S will drastically blow up my DataFrame, but in the next step, I need to do calculations on each countN column to get their basic statistics "sum", "mean", "std", "median", "var", "min", "max", "quantiles", etc. and that's why I need all these null values - so my time series is expanded on the full length and my calculations won't be distorted.我知道,以1S的偏移量这样做会彻底炸毁我的 DataFrame,但在下一步中,我需要对每个countN列进行计算以获得它们的基本统计信息“sum”、“mean”、“std”、“中位数”,“var”,“min”,“max”,“quantiles”等,这就是为什么我需要所有这些 null 值 - 所以我的时间序列在全长上扩展,我的计算不会被扭曲。

Clarification : After enlarging each of the groups, I would like to start the calculation of each groups statistics.澄清:放大每个组后,我想开始计算每个组的统计信息。 For this, there are two next steps I can think of: (1) Either concatinate all enlarged groups back to one huge DataFrame.为此,我可以想到接下来的两个步骤:(1)将所有放大的组连接回一个巨大的 DataFrame。 I then would group again with enlarged_df.groupby([var1, var2]) and call an aggregation function on each of the countN columns -or- what may be more efficient but I can't think of a solution how to do this right now , (2) maybe use something like.apply on the already grouped and enlarged data?然后我将再次与 mapped_df.groupby enlarged_df.groupby([var1, var2])分组,并在每个countN列上调用聚合 function - 或者 -什么可能更有效,但我现在想不出一个解决方案如何做到这一点, (2) 可能在已经分组和放大的数据上使用类似 .apply 的东西? Some function like this:一些 function 像这样:

lst = []
# go through all countN columns and calculate their statistics
for count_col in [c for c in df.columns if "count" in c]:
   df_tmp = df[count_col].agg(["sum", "mean", "std", "median", "var", "min", "max"])
   df_tmp.columns = [f"{count_col}" + str(c) for c in df_tmp.columns]
   lst.append(df_tmp)

# join all the calculations of all countN columns to one DataFrame
final_df = lst.pop(0)
for df_tmp in lst:
   final_df = final_df.join(df_tmp)

final_df
  var1   var2  count1_sum count1_mean ... count2_sum count2_mean ...
1  "A"    "B"           1           1             12          12
2  "C"    "C"           2           2            179         179
3  "C"    "D"           1           1             72          72
4  "C"    "E"           4           4              7           7
5  "D"    "E"           2           2             47          47
6  "H"    "F"           1           1             31          31
7  "ABC"  "DEF"        10           5             84          42
...

I'm particularily interested in speed, regarding the sizes this DataFrame could achieve.我对速度特别感兴趣,关于 DataFrame 可以达到的尺寸。 Sitting on this a few days now.坐了几天了。 Thanks for helping!感谢您的帮助!

My approach would be first reindexing the groups, then separately filling the nans in var1 , var2 , count1 , and count2 , and then directly computing the various statistics.我的方法是首先重新索引组,然后分别填充var1var2count1count2中的 nan,然后直接计算各种统计信息。 Here's an example for just the mean and std statistics:以下是meanstd统计的示例:

last_day = df.datetime.max()
first_day = df.datetime.min()
idx = pd.date_range(first_day, last_day, freq='s')
                
def apply_function(g):   
    g.index = pd.DatetimeIndex(g.pop('datetime'))
    g = g.reindex(idx, fill_value=np.nan)

    g[['var1', 'var2']] = g[['var1','var2']].fillna(method='ffill').fillna(method='bfill')
    g[['count1', 'count2']] = g[['count1','count2']].fillna(0)

    return pd.Series(dict(
        mean_1 = g.count1.mean(),
        mean_2 = g.count2.mean(),
        std_1 = g.count1.std(),
        std_2 = g.count2.std()))
    
df.groupby(['var1', 'var2']).apply(apply_function)

The result is the following:结果如下:

             mean_1     mean_2     std_1       std_2
var1 var2                                           
A    B     0.333333   4.000000  0.577350    6.928203
ABC  DEF   3.333333  28.000000  3.511885   40.149720
C    C     0.666667  59.666667  1.154701  103.345698
     D     0.333333  24.000000  0.577350   41.569219
     E     1.333333   2.333333  2.309401    4.041452
D    E     0.666667  15.666667  1.154701   27.135463
H    F     0.333333  10.333333  0.577350   17.897858

Otherwise, you first fix each group and then calculate the statistics:否则,您首先修复每个组,然后计算统计信息:

gp = df.groupby(['var1', 'var2'])
my_g = gp.get_group(('ABC', 'DEF'))

my_g.index = pd.DatetimeIndex(my_g.pop('datetime'))
my_g = my_g.reindex(idx, fill_value=np.nan)
my_g[['var1', 'var2']] = my_g[['var1','var2']].fillna(method='ffill').fillna(method='bfill')
my_g[['count1', 'count2']] = my_g[['count1','count2']].fillna(0)
print(my_g)

Output: Output:

                    var1 var2  count1  count2
2020-03-01 00:00:01  ABC  DEF     0.0     0.0
2020-03-01 00:00:02  ABC  DEF     7.0    74.0
2020-03-01 00:00:03  ABC  DEF     3.0    10.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 应用引用多列的 groupby 最快最有效的方法 - Fastest most efficient way to apply groupby that references multiple columns 在groupby中按日期时间过滤的有效方法 - Efficient way of filtering by datetime in groupby 计算 Pandas DataFrame 中一组列的平均值的最有效方法 - Most efficient way to calculate the mean of a group of columns in a pandas DataFrame 正确使用 groupby 重采样聚合 function - Right way to use groupby resample aggregate function 根据天计算与 pandas 日期时间列的差异的有效方法 - Efficient way to calculate difference from pandas datetime columns based on days 在groupby上执行求和后保留输出中的列 - Preserving columns in output after performing sum on groupby 在保留顺序和删除最旧元素的同时,从Python列表中删除重复项的最有效方法 - Most efficient way to remove duplicates from Python list while preserving order and removing the oldest element 从groupby返回列表的最有效方法 - Most efficient way to return list from a groupby Python dataframe - 重新采样时间戳,按小时分组,但保留开始和结束日期时间 - Python dataframe - resample timestamps, group by hour, but keep the start and end datetime 将天数转换为日期时间的最有效方法 object - Most efficient way of converting days to a datetime object
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM