简体   繁体   English

通过转换函数增量添加到 Pandas Group

[英]Incrementally Adding To Pandas Groupby Transform Function

I have a large DataFrame with many columns that are GroupBy functions of the original data.我有一个包含许多列的大型 DataFrame,这些列是原始数据的 GroupBy 函数。 Computing all these functions takes a long time.计算所有这些函数需要很长时间。 Each day I get some new data and currently I compute all these functions from scratch.每天我都会得到一些新数据,目前我从头开始计算所有这些函数。 Is there a way to do these GroupBy functions without having to compute the whole functions again.有没有办法在不必再次计算整个函数的情况下执行这些 GroupBy 函数。 I will provide a small DataFrame as an example:我将提供一个小的 DataFrame 作为示例:

df = pd.DataFrame({'x': [0, 1, 2, 5, 4, 5, 8, 7], 'g1': ['a', 'b', 'c', 'a', 'b', 'c', 'a', 'a'], 'g2': ['a', 'b', 'a', 'a', 'b', 'b', 'a', 'a']})

   x g1 g2
0  0  a  a
1  1  b  b
2  2  c  a
3  5  a  a
4  4  b  b
5  5  c  b
6  8  a  a
7  7  a  a

Now an example column:现在是一个示例列:

def lag(array):
    out = np.nan * array
    out[1:] = array[:-1]
    return out

df['y'] = df.groupby(['g1', 'g2'])['x'].transform(lag)

   x g1 g2    y
0  0  a  a  NaN
1  1  b  b  NaN
2  2  c  a  NaN
3  5  a  a  0.0
4  4  b  b  1.0
5  5  c  b  NaN
6  8  a  a  5.0
7  7  a  a  8.0

Now let's say I get some new data to append to my original DataFrame:现在假设我有一些新数据附加到我的原始 DataFrame 中:

newdf = pd.DataFrame({'x': [2, 1], 'g1': ['a', 'b'], 'g2': ['a', 'b']})
df = df.append(newdf)

   x g1 g2    y
0  0  a  a  NaN
1  1  b  b  NaN
2  2  c  a  NaN
3  5  a  a  0.0
4  4  b  b  1.0
5  5  c  b  NaN
6  8  a  a  5.0
7  7  a  a  8.0
0  2  a  a  NaN
1  1  b  b  NaN

Is there now a way to work out 'y' for the last 2 rows without just recalculating the whole column to produce the following DataFrame?现在有没有一种方法可以为最后 2 行计算 'y',而无需重新计算整个列以生成以下 DataFrame?

   x g1 g2    y
0  0  a  a  NaN
1  1  b  b  NaN
2  2  c  a  NaN
3  5  a  a  0.0
4  4  b  b  1.0
5  5  c  b  NaN
6  8  a  a  5.0
7  7  a  a  8.0
0  2  a  a  7.0
1  1  b  b  4.0

One way of doing this is to do this:这样做的一种方法是这样做:

Create first a column that indicates which rows lag has been applied to, then apply lag to the rows that haven't by using mask首先创建一个列,指示已应用滞后的行,然后使用掩码将滞后应用到尚未应用的行

df['applied'] = 1
df = df.append(newdf)
df['y'].mask(df['applied']!=1, df.groupby(['g1', 'g2'])['x'].transform(lag), inplace=True)

which gives这使

 x g1 g2    y  applied
0  0  a  a  NaN      1.0
1  1  b  b  NaN      1.0
2  2  c  a  NaN      1.0
3  5  a  a  0.0      1.0
4  4  b  b  1.0      1.0
5  5  c  b  NaN      1.0
6  8  a  a  5.0      1.0
7  7  a  a  8.0      1.0
0  2  a  a  7.0      NaN
1  1  b  b  4.0      NaN

then drop the applied column:然后删除applied列:

df = df.drop(['applied'], axis=1)

which gives you what you wanted:这给了你你想要的:

x g1 g2    y
0  0  a  a  NaN
1  1  b  b  NaN
2  2  c  a  NaN
3  5  a  a  0.0
4  4  b  b  1.0
5  5  c  b  NaN
6  8  a  a  5.0
7  7  a  a  8.0
0  2  a  a  7.0
1  1  b  b  4.0
​```

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM