简体   繁体   English

熊猫:根据条件为每个组添加行

[英]Pandas: add row to each group depending on condition

Let's say I have a DataFrame like this: 假设我有一个像这样的DataFrame:

         date  id  val
0  2017-01-01   1   10
1  2019-01-01   1   20
2  2017-01-01   2   50

I want to group this dataset by id . 我想按id分组这个数据集。
For each group, I want to add a new row to it, with the date being be 1 year from now. 对于每个组,我想为其添加一个新行,日期为1年。 This row should only be added IF it is later than the last date in the group. 仅当行晚于组中的最后一个日期时才应添加此行。 The row's val should be the same as the last row in the group. 行的val应该与组中的最后一行相同。

The final table should look like this: 决赛桌应如下所示:

         date  id  val
0  2017-01-01   1   10
1  2019-01-01   1   20
2  2017-01-01   2   50
3  2018-09-25   2   50   <-- new row

The current code is below. 目前的代码如下。 I can get a mask showing which groups need a row appended, but not sure what to do next. 我可以获得一个掩码,显示哪些组需要附加一行,但不知道下一步该做什么。

>>> df = pd.DataFrame(data={'d': [datetime.date(2017, 1, 1), datetime.date(2019,1,1), datetime.date(2017,1,1)], 'id': [1,1,2], 'val': [10,20,50]})
>>> df = df.sort_values(by='d')
>>> future_date = (pd.datetime.now().date() + pd.DateOffset(years=1)).date()
>>> maxd = df.groupby('id')['d'].max()
>>> maxd < future_date
id
1    False
2     True
Name: d, dtype: bool

Here's one way 这是一种方式

In [3481]: def add_row(x):
      ...:     next_year = pd.to_datetime('today') + pd.DateOffset(years=1)
      ...:     if x['date'].max() < next_year:
      ...:         last_row = x.iloc[-1]
      ...:         last_row['date'] = next_year
      ...:         return x.append(last_row)
      ...:     return x
      ...:

In [3482]: df.groupby('id').apply(add_row).reset_index(drop=True)
Out[3482]:
        date  id  val
0 2017-01-01   1   10
1 2019-01-01   1   20
2 2017-01-01   2   50
3 2018-09-25   2   50

You can use idxmax with loc for rows with max date : 对于具有max date行,您可以将idxmaxloc一起使用:

future_date = pd.to_datetime('today') + pd.DateOffset(years=1)
maxd = df.loc[df.groupby('id')['d'].idxmax()]

maxd = maxd[maxd['d'] < future_date]
maxd['d'] = future_date
print (maxd)
           d  id  val
2 2018-09-25   2   50

df = pd.concat([df, maxd]).sort_values(['id','d']).reset_index(drop=True)
print (df)
           d  id  val
0 2017-01-01   1   10
1 2019-01-01   1   20
2 2017-01-01   2   50
3 2018-09-25   2   50

A different way to look at it, use duplicated to find last row per 'id' 查看它的另一种方法是,使用duplicated来查找每个'id'最后一行

t = df[~df.duplicated('id', 'last')]
df.append(
    t.assign(
        date=pd.to_datetime('today') + pd.DateOffset(years=1)
    ).pipe(lambda d: d[d.date > t.date]),
    ignore_index=True).sort_values(['id', 'date'])

        date  id  val
0 2017-01-01   1   10
1 2019-01-01   1   20
2 2017-01-01   2   50
3 2018-09-24   2   50

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM