简体   繁体   中英

Pandas Dataframe Create Column of Next Future Date from Unique values of two other columns, with Groupby

I have a dataframe which can be created with this:

import pandas as pd
import datetime

#create df
data={'id':[1,1,1,1,2,2,2,2],
      'date1':[datetime.date(2016,1,1),datetime.date(2016,7,23),datetime.date(2017,2,26),datetime.date(2017,5,28),
               datetime.date(2015,11,1),datetime.date(2016,7,23),datetime.date(2017,6,28),datetime.date(2017,5,23)],
      'date2':[datetime.date(2017,5,12),datetime.date(2016,8,10),datetime.date(2017,10,26),datetime.date(2017,9,22),
               datetime.date(2015,11,9),datetime.date(2016,9,23),datetime.date(2017,8,3),datetime.date(2017,9,22)]}
df=pd.DataFrame.from_dict(data)
df=df[['id','date1','date2']]

And looks like this:

df
Out[83]: 
   id       date1       date2
0   1  2016-01-01  2017-05-12
1   1  2016-07-23  2016-08-10
2   1  2017-02-26  2017-10-26
3   1  2017-05-28  2017-09-22
4   2  2015-11-01  2015-11-09
5   2  2016-07-23  2016-09-23
6   2  2017-06-28  2017-08-03
7   2  2017-05-23  2017-09-22

What I need to do is create a new column called 'newdate' which at the groupby['id'] level will take all the unique grouped by date values from columns date1 and date2, and give me the NEXT FUTURE date from those unique values after the date in date2.

So the new dataframe would look like:

df
Out[87]: 
   id       date1       date2     newdate
0   1  2016-01-01  2017-05-12  2017-05-28
1   1  2016-07-23  2016-08-10  2017-02-26
2   1  2017-02-26  2017-10-26        None
3   1  2017-05-28  2017-09-22  2017-10-26
4   2  2015-11-01  2015-11-09  2016-07-23
5   2  2016-07-23  2016-09-23  2017-05-23
6   2  2017-06-28  2017-08-03  2017-09-22
7   2  2017-05-23  2017-09-22        None

For clarification, take a look at the id=2 records. note in row 4, the newdate is 2016-07-23. This is because it is the FIRST date from all of the dates represented for id=2 in columns date1 & date2, that FOLLOWS the row 4 date2.

We definitely need to use groupby. I think we could use some form(s) of unique(), np.unique, pd.unique to get the dates? But then how do you select the 'NEXT' one and assign? Just stumped...

Few other points. Don't assume the dataframe is sorted in any way, and efficiency is important here because the actual dataframe is very large. Note also that the 'None' values in newdate are there because we have no 'NEXT' future date represented, as the maximum date in the subset is the same as date2. We can use None, nan, whatever to represent these...

EDIT: Based on Wen's answer, his answer fails if like dates. If you use this dataset:

data={'id':[1,1,1,1,2,2,2,2],
      'date1':[datetime.date(2016,1,1),datetime.date(2016,7,23),datetime.date(2017,2,26),datetime.date(2017,5,28),
               datetime.date(2015,11,1),datetime.date(2016,7,23),datetime.date(2017,6,28),datetime.date(2017,5,23)],
      'date2':[datetime.date(2017,5,12),datetime.date(2017,5,12),datetime.date(2017,2,26),datetime.date(2017,9,22),
               datetime.date(2015,11,9),datetime.date(2016,9,23),datetime.date(2017,8,3),datetime.date(2017,9,22)]}
df=pd.DataFrame.from_dict(data)
df=df[['id','date1','date2']]

Then the result is:

df
Out[104]: 
   id       date1       date2     newdate
0   1  2016-01-01  2017-05-12  2017-05-12
1   1  2016-07-23  2017-05-12  2017-05-28
2   1  2017-02-26  2017-02-26  2017-05-12
3   1  2017-05-28  2017-09-22         NaN
4   2  2015-11-01  2015-11-09  2016-07-23
5   2  2016-07-23  2016-09-23  2017-05-23
6   2  2017-06-28  2017-08-03  2017-09-22
7   2  2017-05-23  2017-09-22         NaN

Note that row 0 'newdate' should be 2017-05-28, the 'next' available date from the superset of date1&date2 for id==1.

I believe melt gets us closer though...

Perhaps not the quickest, depending on your actual dataframe ("very large" could mean anything). Basically two steps - first create a lookup table for every date to the next date. Then merge that lookup with the original table.

#get the latest date for each row - just the max of date1 and date2
df['latest_date'] = df.loc[:, ['date1','date2']].max(axis=1)

#for each date, find the next date - basically create a lookup table
new_date_lookup = (df
                   .melt(id_vars=['id'], value_vars=['date1', 'date2'])
                   .loc[:, ['id','value']]
                  )

new_date_lookup = (new_date_lookup
                   .merge(new_date_lookup, on="id")
                   .query("value_y > value_x")
                   .groupby(["id", "value_x"])
                   .min()
                   .reset_index()
                   .rename(columns={'value_x': 'value', 'value_y':'new_date'})
                  )

#merge the original and lookup table together to get the new_date for each row
new_df = (pd
          .merge(df, new_date_lookup, how='left', left_on=['id', 'latest_date'], right_on=['id','value'])
          .drop(['latest_date', 'value'], axis=1)
         )

print(new_df)

Which gives the output:

   id       date1       date2    new_date
0   1  2016-01-01  2017-05-12  2017-05-28
1   1  2016-07-23  2016-08-10  2017-02-26
2   1  2017-02-26  2017-10-26         NaN
3   1  2017-05-28  2017-09-22  2017-10-26
4   2  2015-11-01  2015-11-09  2016-07-23
5   2  2016-07-23  2016-09-23  2017-05-23
6   2  2017-06-28  2017-08-03  2017-09-22
7   2  2017-05-23  2017-09-22         NaN

And for the second example, added in the edit, gives the output:

   id       date1       date2    new_date
0   1  2016-01-01  2017-05-12  2017-05-28
1   1  2016-07-23  2017-05-12  2017-05-28
2   1  2017-02-26  2017-02-26  2017-05-12
3   1  2017-05-28  2017-09-22         NaN
4   2  2015-11-01  2015-11-09  2016-07-23
5   2  2016-07-23  2016-09-23  2017-05-23
6   2  2017-06-28  2017-08-03  2017-09-22
7   2  2017-05-23  2017-09-22         NaN

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM