在保留某些列的同时，在 groupby 上使用带有开始和结束日期时间的重新采样的最有效方法 - 并在此之后计算统计信息

Question

I work with huge DataFrames in terms of shape, my example is only a boiled down one.我在形状方面使用巨大的 DataFrame，我的例子只是一个简化的例子。

Lets assume the following scenario:让我们假设以下场景：

# we have these two datetime objects as start and end for my data set
first_day = 2020-03-01 00:00:00
last_day = 2020-03-31 23:59:59

# assume we have a big DataFrame df like this with many, many rows:
              datetime   var1   var2  count1  count2
1  2020-03-01 00:00:01    "A"    "B"       1      12
2  2020-03-01 00:00:01    "C"    "C"       2     179
3  2020-03-01 00:00:01    "C"    "D"       1      72
4  2020-03-01 00:00:02    "C"    "E"       4       7
5  2020-03-01 00:00:02    "D"    "E"       2      47
6  2020-03-01 00:00:02    "H"    "F"       1      31
7  2020-03-01 00:00:02  "ABC"  "DEF"       7      74
8  2020-03-01 00:00:03  "ABC"  "DEF"       3      10
...

# I now want to group on this DataFrame like this:
gb = df.groupby([var1, var2])

# what yields me groups like this as an example:
              datetime   var1   var2  count1  count2
7  2020-03-01 00:00:02  "ABC"  "DEF"       7      74
8  2020-03-01 00:00:03  "ABC"  "DEF"       3      10

What I need to do now is resample every group with the given first_day and last_day and an Offset alias 1S , so I get something like this for each one:我现在需要做的是使用给定的first_day和last_day以及一个 Offset 别名1S对每个组进行重新采样，所以我对每个组都得到类似的结果：

              datetime   var1   var2  count1  count2
0  2020-03-01 00:00:00  "ABC"  "DEF"       0       0
1  2020-03-01 00:00:01  "ABC"  "DEF"       0       0
2  2020-03-01 00:00:02  "ABC"  "DEF"       7      74
3  2020-03-01 00:00:03  "ABC"  "DEF"       3      10
4  2020-03-01 00:00:04  "ABC"  "DEF"       0       0
5  2020-03-01 00:00:05  "ABC"  "DEF"       0       0
...
n  2020-03-31 23:59:59  "ABC"  "DEF"       0       0

The tricky part is, the columns of var1 to varN are not allowed to get null'd and need to be preservered, only the columns of count1 to countN need to get null'd.棘手的部分是， var1到varN的列不允许为空，需要保留，只有count1到countN的列需要为空。 I know, doing so with an offset of 1S will drastically blow up my DataFrame, but in the next step, I need to do calculations on each countN column to get their basic statistics "sum", "mean", "std", "median", "var", "min", "max", "quantiles", etc. and that's why I need all these null values - so my time series is expanded on the full length and my calculations won't be distorted.我知道，以1S的偏移量这样做会彻底炸毁我的 DataFrame，但在下一步中，我需要对每个countN列进行计算以获得它们的基本统计信息“sum”、“mean”、“std”、“中位数”，“var”，“min”，“max”，“quantiles”等，这就是为什么我需要所有这些 null 值 - 所以我的时间序列在全长上扩展，我的计算不会被扭曲。

Clarification : After enlarging each of the groups, I would like to start the calculation of each groups statistics.澄清：放大每个组后，我想开始计算每个组的统计信息。 For this, there are two next steps I can think of: (1) Either concatinate all enlarged groups back to one huge DataFrame.为此，我可以想到接下来的两个步骤：（1）将所有放大的组连接回一个巨大的 DataFrame。 I then would group again with enlarged_df.groupby([var1, var2]) and call an aggregation function on each of the countN columns -or- what may be more efficient but I can't think of a solution how to do this right now , (2) maybe use something like.apply on the already grouped and enlarged data?然后我将再次与 mapped_df.groupby enlarged_df.groupby([var1, var2])分组，并在每个countN列上调用聚合 function - 或者 -什么可能更有效，但我现在想不出一个解决方案如何做到这一点, (2) 可能在已经分组和放大的数据上使用类似 .apply 的东西？ Some function like this:一些 function 像这样：

lst = []
# go through all countN columns and calculate their statistics
for count_col in [c for c in df.columns if "count" in c]:
   df_tmp = df[count_col].agg(["sum", "mean", "std", "median", "var", "min", "max"])
   df_tmp.columns = [f"{count_col}" + str(c) for c in df_tmp.columns]
   lst.append(df_tmp)

# join all the calculations of all countN columns to one DataFrame
final_df = lst.pop(0)
for df_tmp in lst:
   final_df = final_df.join(df_tmp)

final_df
  var1   var2  count1_sum count1_mean ... count2_sum count2_mean ...
1  "A"    "B"           1           1             12          12
2  "C"    "C"           2           2            179         179
3  "C"    "D"           1           1             72          72
4  "C"    "E"           4           4              7           7
5  "D"    "E"           2           2             47          47
6  "H"    "F"           1           1             31          31
7  "ABC"  "DEF"        10           5             84          42
...

I'm particularily interested in speed, regarding the sizes this DataFrame could achieve.我对速度特别感兴趣，关于 DataFrame 可以达到的尺寸。 Sitting on this a few days now.坐了几天了。 Thanks for helping!感谢您的帮助！

Answer 1

My approach would be first reindexing the groups, then separately filling the nans in var1 , var2 , count1 , and count2 , and then directly computing the various statistics.我的方法是首先重新索引组，然后分别填充var1 、 var2 、 count1和count2中的 nan，然后直接计算各种统计信息。 Here's an example for just the mean and std statistics:以下是mean和std统计的示例：

last_day = df.datetime.max()
first_day = df.datetime.min()
idx = pd.date_range(first_day, last_day, freq='s')
                
def apply_function(g):   
    g.index = pd.DatetimeIndex(g.pop('datetime'))
    g = g.reindex(idx, fill_value=np.nan)

    g[['var1', 'var2']] = g[['var1','var2']].fillna(method='ffill').fillna(method='bfill')
    g[['count1', 'count2']] = g[['count1','count2']].fillna(0)

    return pd.Series(dict(
        mean_1 = g.count1.mean(),
        mean_2 = g.count2.mean(),
        std_1 = g.count1.std(),
        std_2 = g.count2.std()))
    
df.groupby(['var1', 'var2']).apply(apply_function)

The result is the following:结果如下：

             mean_1     mean_2     std_1       std_2
var1 var2                                           
A    B     0.333333   4.000000  0.577350    6.928203
ABC  DEF   3.333333  28.000000  3.511885   40.149720
C    C     0.666667  59.666667  1.154701  103.345698
     D     0.333333  24.000000  0.577350   41.569219
     E     1.333333   2.333333  2.309401    4.041452
D    E     0.666667  15.666667  1.154701   27.135463
H    F     0.333333  10.333333  0.577350   17.897858

Otherwise, you first fix each group and then calculate the statistics:否则，您首先修复每个组，然后计算统计信息：

gp = df.groupby(['var1', 'var2'])
my_g = gp.get_group(('ABC', 'DEF'))

my_g.index = pd.DatetimeIndex(my_g.pop('datetime'))
my_g = my_g.reindex(idx, fill_value=np.nan)
my_g[['var1', 'var2']] = my_g[['var1','var2']].fillna(method='ffill').fillna(method='bfill')
my_g[['count1', 'count2']] = my_g[['count1','count2']].fillna(0)
print(my_g)

Output: Output：

                    var1 var2  count1  count2
2020-03-01 00:00:01  ABC  DEF     0.0     0.0
2020-03-01 00:00:02  ABC  DEF     7.0    74.0
2020-03-01 00:00:03  ABC  DEF     3.0    10.0

在保留某些列的同时，在 groupby 上使用带有开始和结束日期时间的重新采样的最有效方法 - 并在此之后计算统计信息

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-01-22 15:46:32

在保留某些列的同时，在 groupby 上使用带有开始和结束日期时间的重新采样的最有效方法 - 并在此之后计算统计信息

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-01-22 15:46:32

解决方案1
1 已采纳 2021-01-22 15:46:32