[英]Most efficient way to use resample on groupby with start and end datetime while preserving certain columns - and calculate statistics after that
I work with huge DataFrames in terms of shape, my example is only a boiled down one.我在形状方面使用巨大的 DataFrame,我的例子只是一个简化的例子。
Lets assume the following scenario:让我们假设以下场景:
# we have these two datetime objects as start and end for my data set
first_day = 2020-03-01 00:00:00
last_day = 2020-03-31 23:59:59
# assume we have a big DataFrame df like this with many, many rows:
datetime var1 var2 count1 count2
1 2020-03-01 00:00:01 "A" "B" 1 12
2 2020-03-01 00:00:01 "C" "C" 2 179
3 2020-03-01 00:00:01 "C" "D" 1 72
4 2020-03-01 00:00:02 "C" "E" 4 7
5 2020-03-01 00:00:02 "D" "E" 2 47
6 2020-03-01 00:00:02 "H" "F" 1 31
7 2020-03-01 00:00:02 "ABC" "DEF" 7 74
8 2020-03-01 00:00:03 "ABC" "DEF" 3 10
...
# I now want to group on this DataFrame like this:
gb = df.groupby([var1, var2])
# what yields me groups like this as an example:
datetime var1 var2 count1 count2
7 2020-03-01 00:00:02 "ABC" "DEF" 7 74
8 2020-03-01 00:00:03 "ABC" "DEF" 3 10
What I need to do now is resample every group with the given first_day
and last_day
and an Offset alias 1S
, so I get something like this for each one:我现在需要做的是使用给定的first_day
和last_day
以及一个 Offset 别名1S
对每个组进行重新采样,所以我对每个组都得到类似的结果:
datetime var1 var2 count1 count2
0 2020-03-01 00:00:00 "ABC" "DEF" 0 0
1 2020-03-01 00:00:01 "ABC" "DEF" 0 0
2 2020-03-01 00:00:02 "ABC" "DEF" 7 74
3 2020-03-01 00:00:03 "ABC" "DEF" 3 10
4 2020-03-01 00:00:04 "ABC" "DEF" 0 0
5 2020-03-01 00:00:05 "ABC" "DEF" 0 0
...
n 2020-03-31 23:59:59 "ABC" "DEF" 0 0
The tricky part is, the columns of var1
to varN
are not allowed to get null'd and need to be preservered, only the columns of count1
to countN
need to get null'd.棘手的部分是, var1
到varN
的列不允许为空,需要保留,只有count1
到countN
的列需要为空。 I know, doing so with an offset of 1S
will drastically blow up my DataFrame, but in the next step, I need to do calculations on each countN
column to get their basic statistics "sum", "mean", "std", "median", "var", "min", "max", "quantiles", etc. and that's why I need all these null values - so my time series is expanded on the full length and my calculations won't be distorted.我知道,以1S
的偏移量这样做会彻底炸毁我的 DataFrame,但在下一步中,我需要对每个countN
列进行计算以获得它们的基本统计信息“sum”、“mean”、“std”、“中位数”,“var”,“min”,“max”,“quantiles”等,这就是为什么我需要所有这些 null 值 - 所以我的时间序列在全长上扩展,我的计算不会被扭曲。
Clarification : After enlarging each of the groups, I would like to start the calculation of each groups statistics.澄清:放大每个组后,我想开始计算每个组的统计信息。 For this, there are two next steps I can think of: (1) Either concatinate all enlarged groups back to one huge DataFrame.为此,我可以想到接下来的两个步骤:(1)将所有放大的组连接回一个巨大的 DataFrame。 I then would group again with enlarged_df.groupby([var1, var2])
and call an aggregation function on each of the countN
columns -or- what may be more efficient but I can't think of a solution how to do this right now , (2) maybe use something like.apply on the already grouped and enlarged data?然后我将再次与 mapped_df.groupby enlarged_df.groupby([var1, var2])
分组,并在每个countN
列上调用聚合 function - 或者 -什么可能更有效,但我现在想不出一个解决方案如何做到这一点, (2) 可能在已经分组和放大的数据上使用类似 .apply 的东西? Some function like this:一些 function 像这样:
lst = []
# go through all countN columns and calculate their statistics
for count_col in [c for c in df.columns if "count" in c]:
df_tmp = df[count_col].agg(["sum", "mean", "std", "median", "var", "min", "max"])
df_tmp.columns = [f"{count_col}" + str(c) for c in df_tmp.columns]
lst.append(df_tmp)
# join all the calculations of all countN columns to one DataFrame
final_df = lst.pop(0)
for df_tmp in lst:
final_df = final_df.join(df_tmp)
final_df
var1 var2 count1_sum count1_mean ... count2_sum count2_mean ...
1 "A" "B" 1 1 12 12
2 "C" "C" 2 2 179 179
3 "C" "D" 1 1 72 72
4 "C" "E" 4 4 7 7
5 "D" "E" 2 2 47 47
6 "H" "F" 1 1 31 31
7 "ABC" "DEF" 10 5 84 42
...
I'm particularily interested in speed, regarding the sizes this DataFrame could achieve.我对速度特别感兴趣,关于 DataFrame 可以达到的尺寸。 Sitting on this a few days now.坐了几天了。 Thanks for helping!感谢您的帮助!
My approach would be first reindexing the groups, then separately filling the nans in var1
, var2
, count1
, and count2
, and then directly computing the various statistics.我的方法是首先重新索引组,然后分别填充var1
、 var2
、 count1
和count2
中的 nan,然后直接计算各种统计信息。 Here's an example for just the mean
and std
statistics:以下是mean
和std
统计的示例:
last_day = df.datetime.max()
first_day = df.datetime.min()
idx = pd.date_range(first_day, last_day, freq='s')
def apply_function(g):
g.index = pd.DatetimeIndex(g.pop('datetime'))
g = g.reindex(idx, fill_value=np.nan)
g[['var1', 'var2']] = g[['var1','var2']].fillna(method='ffill').fillna(method='bfill')
g[['count1', 'count2']] = g[['count1','count2']].fillna(0)
return pd.Series(dict(
mean_1 = g.count1.mean(),
mean_2 = g.count2.mean(),
std_1 = g.count1.std(),
std_2 = g.count2.std()))
df.groupby(['var1', 'var2']).apply(apply_function)
The result is the following:结果如下:
mean_1 mean_2 std_1 std_2
var1 var2
A B 0.333333 4.000000 0.577350 6.928203
ABC DEF 3.333333 28.000000 3.511885 40.149720
C C 0.666667 59.666667 1.154701 103.345698
D 0.333333 24.000000 0.577350 41.569219
E 1.333333 2.333333 2.309401 4.041452
D E 0.666667 15.666667 1.154701 27.135463
H F 0.333333 10.333333 0.577350 17.897858
Otherwise, you first fix each group and then calculate the statistics:否则,您首先修复每个组,然后计算统计信息:
gp = df.groupby(['var1', 'var2'])
my_g = gp.get_group(('ABC', 'DEF'))
my_g.index = pd.DatetimeIndex(my_g.pop('datetime'))
my_g = my_g.reindex(idx, fill_value=np.nan)
my_g[['var1', 'var2']] = my_g[['var1','var2']].fillna(method='ffill').fillna(method='bfill')
my_g[['count1', 'count2']] = my_g[['count1','count2']].fillna(0)
print(my_g)
Output: Output:
var1 var2 count1 count2
2020-03-01 00:00:01 ABC DEF 0.0 0.0
2020-03-01 00:00:02 ABC DEF 7.0 74.0
2020-03-01 00:00:03 ABC DEF 3.0 10.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.