I have a pandas DataFrame that is generated by this snippet:
elig = pd.DataFrame({'memberid': [1,1,1,1,1,1,2],
'monthid': [201711, 201712, 201801, 201805, 201806, 201807, 201810]})
and I would like to perform a .groupby
operation on memberid
based on continuous values of monthid
, eg, I would like the (very) end result to be a table looking like this:
memberid | start_month | end_month
1 | 201711 | 201801
1 | 201805 | 201807
2 | 201810 | 201810
I was wondering whether there is an idiomatic Pandas way to do this. So far I have tried a convoluted method, defining a new_elig = defaultdict(list)
and then an outside function:
def f(x):
global new_elig
new_elig[x.iloc[0]['memberid']].append(x.iloc[0]['monthid'])
and finally
elig.groupby('memberid')[['memberid', 'monthid']].apply(f)
which takes about 5 minutes for ~700k rows in the original DataFrame in order to create new_elig
, which then I have to manually inspect for each memberid
so as to get the continuous ranges.
Is there a better way? There has to be one :/
Here is one method that I hope is fast enough for your needs. it involves some manual arithmetic on years and months. That feels dirty, but I think this is faster than converting the monthid
column to a Datetime
Series with pd.to_datetime(elig['monthid'], format='%Y%m')
, etc.
# Get the four-digit year with floor division
elig['year'] = elig['monthid']//100
elig['month'] = elig['monthid'] - elig['year']*100
# Boolean mask 1:
# If current row minus previous row is NOT 1 month, flag the row with True.
# Boolean mask 2:
# If months are contiguous (thus slipping past mask 1)
# but memberid changes, flag the row with True.
# (This does not occur in your example data.)
mask1 = (elig['year']*12 + elig['month']).diff() != 1
mask2 = elig['memberid'] != elig['memberid'].shift()
# Convert the flag column to integer and take the cumulative sum.
# This converts the boolean flags into a column that assigns a
# unique integer to each contiguous run of consecutive months belonging
# to the same memberid.
elig['run_id'] = (mask1 | mask2).astype(int).cumsum()
res = (
elig.groupby('run_id')
.agg({'memberid': 'first', 'monthid': ['first', 'last']})
.reset_index(drop=True)
)
res.columns = ['memberid', 'start_month', 'end_month']
res
memberid start_month end_month
0 1 201711 201801
1 1 201805 201807
2 2 201810 201810
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.