[英]Incrementally Adding To Pandas Groupby Transform Function
I have a large DataFrame with many columns that are GroupBy functions of the original data.我有一个包含许多列的大型 DataFrame,这些列是原始数据的 GroupBy 函数。 Computing all these functions takes a long time.
计算所有这些函数需要很长时间。 Each day I get some new data and currently I compute all these functions from scratch.
每天我都会得到一些新数据,目前我从头开始计算所有这些函数。 Is there a way to do these GroupBy functions without having to compute the whole functions again.
有没有办法在不必再次计算整个函数的情况下执行这些 GroupBy 函数。 I will provide a small DataFrame as an example:
我将提供一个小的 DataFrame 作为示例:
df = pd.DataFrame({'x': [0, 1, 2, 5, 4, 5, 8, 7], 'g1': ['a', 'b', 'c', 'a', 'b', 'c', 'a', 'a'], 'g2': ['a', 'b', 'a', 'a', 'b', 'b', 'a', 'a']})
x g1 g2
0 0 a a
1 1 b b
2 2 c a
3 5 a a
4 4 b b
5 5 c b
6 8 a a
7 7 a a
Now an example column:现在是一个示例列:
def lag(array):
out = np.nan * array
out[1:] = array[:-1]
return out
df['y'] = df.groupby(['g1', 'g2'])['x'].transform(lag)
x g1 g2 y
0 0 a a NaN
1 1 b b NaN
2 2 c a NaN
3 5 a a 0.0
4 4 b b 1.0
5 5 c b NaN
6 8 a a 5.0
7 7 a a 8.0
Now let's say I get some new data to append to my original DataFrame:现在假设我有一些新数据附加到我的原始 DataFrame 中:
newdf = pd.DataFrame({'x': [2, 1], 'g1': ['a', 'b'], 'g2': ['a', 'b']})
df = df.append(newdf)
x g1 g2 y
0 0 a a NaN
1 1 b b NaN
2 2 c a NaN
3 5 a a 0.0
4 4 b b 1.0
5 5 c b NaN
6 8 a a 5.0
7 7 a a 8.0
0 2 a a NaN
1 1 b b NaN
Is there now a way to work out 'y' for the last 2 rows without just recalculating the whole column to produce the following DataFrame?现在有没有一种方法可以为最后 2 行计算 'y',而无需重新计算整个列以生成以下 DataFrame?
x g1 g2 y
0 0 a a NaN
1 1 b b NaN
2 2 c a NaN
3 5 a a 0.0
4 4 b b 1.0
5 5 c b NaN
6 8 a a 5.0
7 7 a a 8.0
0 2 a a 7.0
1 1 b b 4.0
One way of doing this is to do this:这样做的一种方法是这样做:
Create first a column that indicates which rows lag has been applied to, then apply lag to the rows that haven't by using mask首先创建一个列,指示已应用滞后的行,然后使用掩码将滞后应用到尚未应用的行
df['applied'] = 1
df = df.append(newdf)
df['y'].mask(df['applied']!=1, df.groupby(['g1', 'g2'])['x'].transform(lag), inplace=True)
which gives这使
x g1 g2 y applied
0 0 a a NaN 1.0
1 1 b b NaN 1.0
2 2 c a NaN 1.0
3 5 a a 0.0 1.0
4 4 b b 1.0 1.0
5 5 c b NaN 1.0
6 8 a a 5.0 1.0
7 7 a a 8.0 1.0
0 2 a a 7.0 NaN
1 1 b b 4.0 NaN
then drop the applied
column:然后删除
applied
列:
df = df.drop(['applied'], axis=1)
which gives you what you wanted:这给了你你想要的:
x g1 g2 y
0 0 a a NaN
1 1 b b NaN
2 2 c a NaN
3 5 a a 0.0
4 4 b b 1.0
5 5 c b NaN
6 8 a a 5.0
7 7 a a 8.0
0 2 a a 7.0
1 1 b b 4.0
```
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.