简体   繁体   中英

Pandas timestamp difference in groupby transform

I have a dataframe with an integer index, session_id, event, and time_stamp that looks like this:

In [41]: df = pd.DataFrame(data={'session_id': np.sort(np.random.choice(np.arange(3), 11)), 'event': np.random.choice(['A', 'B', 'C', 'D'], 11), 'time_stamp': pd.date_range
    ...: ('1/1/2017', periods=11, freq='S')}).reset_index(drop=True)

In [42]: df
Out[42]:
   event  session_id          time_stamp
0      B           0 2017-01-01 00:00:00
1      C           0 2017-01-01 00:00:01
2      D           0 2017-01-01 00:00:02
3      B           1 2017-01-01 00:00:03
4      B           1 2017-01-01 00:00:04
5      D           2 2017-01-01 00:00:05
6      B           2 2017-01-01 00:00:06
7      A           2 2017-01-01 00:00:07
8      B           2 2017-01-01 00:00:08
9      B           2 2017-01-01 00:00:09
10     A           2 2017-01-01 00:00:10

I want to calculate session length using groupby by and a lambda function, but I want to return a series object indexed the same as the original dataframe so I can add it as a column. This should be possible with groupby.transform like this, but it returns a strange "cannot convert object to numpy datetime" error:

In [44]: df.groupby('session_id')['time_stamp'].transform(lambda x: x.max() - x.min())
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-44-c67ed1d4a90e> in <module>()
----> 1 df.groupby('session_id')['time_stamp'].transform(lambda x: x.max() - x.min())

/Users/hendele/anaconda2/lib/python2.7/site-packages/pandas/core/groupby.pyc in transform(self, func, *args, **kwargs)
   2843
   2844             indexer = self._get_index(name)
-> 2845             result[indexer] = res
   2846
   2847         result = _possibly_downcast_to_dtype(result, dtype)

ValueError: Could not convert object to NumPy datetime

I thought I was using this incorrectly, but when you use groupby.agg , it works!

In [43]: df.groupby('session_id')['time_stamp'].agg(lambda x: x.max() - x.min())
Out[43]:
session_id
0   00:00:02
1   00:00:01
2   00:00:05
Name: time_stamp, dtype: timedelta64[ns]

Could you please explain if this is a bug or not, and if not, what I'm doing wrong? Thanks!

ps didn't want to use timestamp index because I may have duplicate timestamps in actual data.

Why does agg work but transform fails?

The difference between these two behaviors is that the transform() operation needs to return a like-indexed. To facilitate this, transform starts with a copy of the original series. Then, after the computation for each group, sets the appropriate elements of the copied series equal to the result. At that point is does a type comparison, and discovers that the timedelta is not cast-able to a datetime . agg() does not perform this step, so does not fail the type check.

A Work Around:

This analysis suggests a work around. If the result of the transform is a datetime , it will succeed. So to work around:

base_time = df['time_stamp'][0]
df.groupby('session_id')['time_stamp'].transform(
    lambda x: x.max() - x.min() + base_time) - base_time

Is this a Bug?

I assume it is a bug, and I plan to file an issue in the morning. I will update here with the issue link.

Update:

I have submitted a bug and a pull request for this issue.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM