简体   繁体   中英

Pandas Groupby with Agg Min/Max date

I am trying to get the min/max dates from columns that might not have a date.

I have tried things like '', "NaN", True, False, 0, etc...

I first set the 'p_date' and 's_date' from the actual date based on the action of '+' or '-'.

 df_td [ 'p_date' ] = np.where ( df_td.action == '+', df_td.date, ??? )
 df_td [ 's_date' ] = np.where ( df_td.action == '-', df_td.date, ??? )

So I need the min of the p_date & max of the s_date here.

  issue      p_date      s_date
0  issue  2012-11-01            
1  issue  2013-12-09            
2  issue  2014-12-08            
3  issue              2016-01-13
4  issue  2012-11-01            
5  issue              2014-03-26
6  issue              2015-05-29
7  issue  2013-12-18            
8  issue              2016-01-13

I do a group by

g = df_td.groupby ( [ 'name', 'type' ], as_index = False ).agg (
     {  ...
        'p_date': 'min',
        's_date': 'max'
      } )

This gives me the following error.

'<=' not supported between instances of 'datetime.date' and 'str'

if I make up a date for the blank space it does work but this is not a great option.

What can I fill the date with to get it to work?

Thanks.

In pandas, NaN is used as the missing value, and is ignored for most operations, so it's the right one to use. If you're still getting an error, it's probably because you've got a datetime.date there (well, you've definitely got that there, I mean that it's probably causing the problems).

For example, if your missing values are "" and your column dtypes are object with internal types of datetime.date , I get:

In [496]: df.groupby("issue").agg({"p_date": "min", "s_date": "max"})
[...]
TypeError: '<=' not supported between instances of 'datetime.date' and 'str'

but if I switch to pandas-native time objects and NaNs, it works:

In [500]: df["p_date"] = pd.to_datetime(df["p_date"])

In [501]: df["s_date"] = pd.to_datetime(df["s_date"])

In [502]: df
Out[502]: 
   issue     p_date     s_date
0  issue 2012-11-01        NaT
1  issue 2013-12-09        NaT
2  issue 2014-12-08        NaT
3  issue        NaT 2016-01-13
4  issue 2012-11-01        NaT
5  issue        NaT 2014-03-26
6  issue        NaT 2015-05-29
7  issue 2013-12-18        NaT
8  issue        NaT 2016-01-13

In [503]: df.groupby("issue").agg({"p_date": "min", "s_date": "max"})
Out[503]: 
          p_date     s_date
issue                      
issue 2012-11-01 2016-01-13

I am not happy with my answer but it seems to work.

I created a floor and ceiling dates.

floor = datetime.date ( 1900, 1, 1 )
ceil = datetime.date ( 2100, 1, 1 )

df_td [ 'p_date' ] = np.where ( df_td.action == '+', df_td.date, ceil )
df_td [ 's_date' ] = np.where ( df_td.action == '-', df_td.date, floor )

That way they will offset the min/max aggregate function that need a data object to work.

still would like a more appropriate suggestion. Thanks.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM