Most efficient way to use resample on groupby with start and end datetime while preserving certain columns - and calculate statistics after that

Question

I work with huge DataFrames in terms of shape, my example is only a boiled down one.

Lets assume the following scenario:

# we have these two datetime objects as start and end for my data set
first_day = 2020-03-01 00:00:00
last_day = 2020-03-31 23:59:59

# assume we have a big DataFrame df like this with many, many rows:
              datetime   var1   var2  count1  count2
1  2020-03-01 00:00:01    "A"    "B"       1      12
2  2020-03-01 00:00:01    "C"    "C"       2     179
3  2020-03-01 00:00:01    "C"    "D"       1      72
4  2020-03-01 00:00:02    "C"    "E"       4       7
5  2020-03-01 00:00:02    "D"    "E"       2      47
6  2020-03-01 00:00:02    "H"    "F"       1      31
7  2020-03-01 00:00:02  "ABC"  "DEF"       7      74
8  2020-03-01 00:00:03  "ABC"  "DEF"       3      10
...

# I now want to group on this DataFrame like this:
gb = df.groupby([var1, var2])

# what yields me groups like this as an example:
              datetime   var1   var2  count1  count2
7  2020-03-01 00:00:02  "ABC"  "DEF"       7      74
8  2020-03-01 00:00:03  "ABC"  "DEF"       3      10

What I need to do now is resample every group with the given first_day and last_day and an Offset alias 1S , so I get something like this for each one:

              datetime   var1   var2  count1  count2
0  2020-03-01 00:00:00  "ABC"  "DEF"       0       0
1  2020-03-01 00:00:01  "ABC"  "DEF"       0       0
2  2020-03-01 00:00:02  "ABC"  "DEF"       7      74
3  2020-03-01 00:00:03  "ABC"  "DEF"       3      10
4  2020-03-01 00:00:04  "ABC"  "DEF"       0       0
5  2020-03-01 00:00:05  "ABC"  "DEF"       0       0
...
n  2020-03-31 23:59:59  "ABC"  "DEF"       0       0

The tricky part is, the columns of var1 to varN are not allowed to get null'd and need to be preservered, only the columns of count1 to countN need to get null'd. I know, doing so with an offset of 1S will drastically blow up my DataFrame, but in the next step, I need to do calculations on each countN column to get their basic statistics "sum", "mean", "std", "median", "var", "min", "max", "quantiles", etc. and that's why I need all these null values - so my time series is expanded on the full length and my calculations won't be distorted.

Clarification : After enlarging each of the groups, I would like to start the calculation of each groups statistics. For this, there are two next steps I can think of: (1) Either concatinate all enlarged groups back to one huge DataFrame. I then would group again with enlarged_df.groupby([var1, var2]) and call an aggregation function on each of the countN columns -or- what may be more efficient but I can't think of a solution how to do this right now , (2) maybe use something like.apply on the already grouped and enlarged data? Some function like this:

lst = []
# go through all countN columns and calculate their statistics
for count_col in [c for c in df.columns if "count" in c]:
   df_tmp = df[count_col].agg(["sum", "mean", "std", "median", "var", "min", "max"])
   df_tmp.columns = [f"{count_col}" + str(c) for c in df_tmp.columns]
   lst.append(df_tmp)

# join all the calculations of all countN columns to one DataFrame
final_df = lst.pop(0)
for df_tmp in lst:
   final_df = final_df.join(df_tmp)

final_df
  var1   var2  count1_sum count1_mean ... count2_sum count2_mean ...
1  "A"    "B"           1           1             12          12
2  "C"    "C"           2           2            179         179
3  "C"    "D"           1           1             72          72
4  "C"    "E"           4           4              7           7
5  "D"    "E"           2           2             47          47
6  "H"    "F"           1           1             31          31
7  "ABC"  "DEF"        10           5             84          42
...

I'm particularily interested in speed, regarding the sizes this DataFrame could achieve. Sitting on this a few days now. Thanks for helping!

Answer 1

My approach would be first reindexing the groups, then separately filling the nans in var1 , var2 , count1 , and count2 , and then directly computing the various statistics. Here's an example for just the mean and std statistics:

last_day = df.datetime.max()
first_day = df.datetime.min()
idx = pd.date_range(first_day, last_day, freq='s')
                
def apply_function(g):   
    g.index = pd.DatetimeIndex(g.pop('datetime'))
    g = g.reindex(idx, fill_value=np.nan)

    g[['var1', 'var2']] = g[['var1','var2']].fillna(method='ffill').fillna(method='bfill')
    g[['count1', 'count2']] = g[['count1','count2']].fillna(0)

    return pd.Series(dict(
        mean_1 = g.count1.mean(),
        mean_2 = g.count2.mean(),
        std_1 = g.count1.std(),
        std_2 = g.count2.std()))
    
df.groupby(['var1', 'var2']).apply(apply_function)

The result is the following:

             mean_1     mean_2     std_1       std_2
var1 var2                                           
A    B     0.333333   4.000000  0.577350    6.928203
ABC  DEF   3.333333  28.000000  3.511885   40.149720
C    C     0.666667  59.666667  1.154701  103.345698
     D     0.333333  24.000000  0.577350   41.569219
     E     1.333333   2.333333  2.309401    4.041452
D    E     0.666667  15.666667  1.154701   27.135463
H    F     0.333333  10.333333  0.577350   17.897858

Otherwise, you first fix each group and then calculate the statistics:

gp = df.groupby(['var1', 'var2'])
my_g = gp.get_group(('ABC', 'DEF'))

my_g.index = pd.DatetimeIndex(my_g.pop('datetime'))
my_g = my_g.reindex(idx, fill_value=np.nan)
my_g[['var1', 'var2']] = my_g[['var1','var2']].fillna(method='ffill').fillna(method='bfill')
my_g[['count1', 'count2']] = my_g[['count1','count2']].fillna(0)
print(my_g)

Output:

                    var1 var2  count1  count2
2020-03-01 00:00:01  ABC  DEF     0.0     0.0
2020-03-01 00:00:02  ABC  DEF     7.0    74.0
2020-03-01 00:00:03  ABC  DEF     3.0    10.0

Most efficient way to use resample on groupby with start and end datetime while preserving certain columns - and calculate statistics after that

Question

1 answers

solution1
1 ACCPTED 2021-01-22 15:46:32

Most efficient way to use resample on groupby with start and end datetime while preserving certain columns - and calculate statistics after that

Question

1 answers

solution1 1 ACCPTED 2021-01-22 15:46:32

solution1
1 ACCPTED 2021-01-22 15:46:32