[英]Fastest most efficient way to apply groupby that references multiple columns
[英]Most efficient way to use resample on groupby with start and end datetime while preserving certain columns - and calculate statistics after that
我在形状方面使用巨大的 DataFrame,我的例子只是一个简化的例子。
让我们假设以下场景:
# we have these two datetime objects as start and end for my data set
first_day = 2020-03-01 00:00:00
last_day = 2020-03-31 23:59:59
# assume we have a big DataFrame df like this with many, many rows:
datetime var1 var2 count1 count2
1 2020-03-01 00:00:01 "A" "B" 1 12
2 2020-03-01 00:00:01 "C" "C" 2 179
3 2020-03-01 00:00:01 "C" "D" 1 72
4 2020-03-01 00:00:02 "C" "E" 4 7
5 2020-03-01 00:00:02 "D" "E" 2 47
6 2020-03-01 00:00:02 "H" "F" 1 31
7 2020-03-01 00:00:02 "ABC" "DEF" 7 74
8 2020-03-01 00:00:03 "ABC" "DEF" 3 10
...
# I now want to group on this DataFrame like this:
gb = df.groupby([var1, var2])
# what yields me groups like this as an example:
datetime var1 var2 count1 count2
7 2020-03-01 00:00:02 "ABC" "DEF" 7 74
8 2020-03-01 00:00:03 "ABC" "DEF" 3 10
我现在需要做的是使用给定的first_day
和last_day
以及一个 Offset 别名1S
对每个组进行重新采样,所以我对每个组都得到类似的结果:
datetime var1 var2 count1 count2
0 2020-03-01 00:00:00 "ABC" "DEF" 0 0
1 2020-03-01 00:00:01 "ABC" "DEF" 0 0
2 2020-03-01 00:00:02 "ABC" "DEF" 7 74
3 2020-03-01 00:00:03 "ABC" "DEF" 3 10
4 2020-03-01 00:00:04 "ABC" "DEF" 0 0
5 2020-03-01 00:00:05 "ABC" "DEF" 0 0
...
n 2020-03-31 23:59:59 "ABC" "DEF" 0 0
棘手的部分是, var1
到varN
的列不允许为空,需要保留,只有count1
到countN
的列需要为空。 我知道,以1S
的偏移量这样做会彻底炸毁我的 DataFrame,但在下一步中,我需要对每个countN
列进行计算以获得它们的基本统计信息“sum”、“mean”、“std”、“中位数”,“var”,“min”,“max”,“quantiles”等,这就是为什么我需要所有这些 null 值 - 所以我的时间序列在全长上扩展,我的计算不会被扭曲。
澄清:放大每个组后,我想开始计算每个组的统计信息。 为此,我可以想到接下来的两个步骤:(1)将所有放大的组连接回一个巨大的 DataFrame。 然后我将再次与 mapped_df.groupby enlarged_df.groupby([var1, var2])
分组,并在每个countN
列上调用聚合 function - 或者 -什么可能更有效,但我现在想不出一个解决方案如何做到这一点, (2) 可能在已经分组和放大的数据上使用类似 .apply 的东西? 一些 function 像这样:
lst = []
# go through all countN columns and calculate their statistics
for count_col in [c for c in df.columns if "count" in c]:
df_tmp = df[count_col].agg(["sum", "mean", "std", "median", "var", "min", "max"])
df_tmp.columns = [f"{count_col}" + str(c) for c in df_tmp.columns]
lst.append(df_tmp)
# join all the calculations of all countN columns to one DataFrame
final_df = lst.pop(0)
for df_tmp in lst:
final_df = final_df.join(df_tmp)
final_df
var1 var2 count1_sum count1_mean ... count2_sum count2_mean ...
1 "A" "B" 1 1 12 12
2 "C" "C" 2 2 179 179
3 "C" "D" 1 1 72 72
4 "C" "E" 4 4 7 7
5 "D" "E" 2 2 47 47
6 "H" "F" 1 1 31 31
7 "ABC" "DEF" 10 5 84 42
...
我对速度特别感兴趣,关于 DataFrame 可以达到的尺寸。 坐了几天了。 感谢您的帮助!
我的方法是首先重新索引组,然后分别填充var1
、 var2
、 count1
和count2
中的 nan,然后直接计算各种统计信息。 以下是mean
和std
统计的示例:
last_day = df.datetime.max()
first_day = df.datetime.min()
idx = pd.date_range(first_day, last_day, freq='s')
def apply_function(g):
g.index = pd.DatetimeIndex(g.pop('datetime'))
g = g.reindex(idx, fill_value=np.nan)
g[['var1', 'var2']] = g[['var1','var2']].fillna(method='ffill').fillna(method='bfill')
g[['count1', 'count2']] = g[['count1','count2']].fillna(0)
return pd.Series(dict(
mean_1 = g.count1.mean(),
mean_2 = g.count2.mean(),
std_1 = g.count1.std(),
std_2 = g.count2.std()))
df.groupby(['var1', 'var2']).apply(apply_function)
结果如下:
mean_1 mean_2 std_1 std_2
var1 var2
A B 0.333333 4.000000 0.577350 6.928203
ABC DEF 3.333333 28.000000 3.511885 40.149720
C C 0.666667 59.666667 1.154701 103.345698
D 0.333333 24.000000 0.577350 41.569219
E 1.333333 2.333333 2.309401 4.041452
D E 0.666667 15.666667 1.154701 27.135463
H F 0.333333 10.333333 0.577350 17.897858
否则,您首先修复每个组,然后计算统计信息:
gp = df.groupby(['var1', 'var2'])
my_g = gp.get_group(('ABC', 'DEF'))
my_g.index = pd.DatetimeIndex(my_g.pop('datetime'))
my_g = my_g.reindex(idx, fill_value=np.nan)
my_g[['var1', 'var2']] = my_g[['var1','var2']].fillna(method='ffill').fillna(method='bfill')
my_g[['count1', 'count2']] = my_g[['count1','count2']].fillna(0)
print(my_g)
Output:
var1 var2 count1 count2
2020-03-01 00:00:01 ABC DEF 0.0 0.0
2020-03-01 00:00:02 ABC DEF 7.0 74.0
2020-03-01 00:00:03 ABC DEF 3.0 10.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.