简体   繁体   中英

Pandas groupby the same column multiple times based on different column values

I have a pandas DataFrame that is generated by this snippet:

elig = pd.DataFrame({'memberid': [1,1,1,1,1,1,2],
                     'monthid': [201711, 201712, 201801, 201805, 201806, 201807, 201810]})

and I would like to perform a .groupby operation on memberid based on continuous values of monthid , eg, I would like the (very) end result to be a table looking like this:

memberid | start_month | end_month

    1    |    201711   |  201801
    1    |    201805   |  201807
    2    |    201810   |  201810

I was wondering whether there is an idiomatic Pandas way to do this. So far I have tried a convoluted method, defining a new_elig = defaultdict(list) and then an outside function:

def f(x):
    global new_elig
    new_elig[x.iloc[0]['memberid']].append(x.iloc[0]['monthid'])

and finally

elig.groupby('memberid')[['memberid', 'monthid']].apply(f)

which takes about 5 minutes for ~700k rows in the original DataFrame in order to create new_elig , which then I have to manually inspect for each memberid so as to get the continuous ranges.

Is there a better way? There has to be one :/

Here is one method that I hope is fast enough for your needs. it involves some manual arithmetic on years and months. That feels dirty, but I think this is faster than converting the monthid column to a Datetime Series with pd.to_datetime(elig['monthid'], format='%Y%m') , etc.

# Get the four-digit year with floor division

elig['year'] = elig['monthid']//100
elig['month'] = elig['monthid'] - elig['year']*100


# Boolean mask 1:
# If current row minus previous row is NOT 1 month, flag the row with True.
# Boolean mask 2:
# If months are contiguous (thus slipping past mask 1) 
# but memberid changes, flag the row with True.
# (This does not occur in your example data.)

mask1 = (elig['year']*12 + elig['month']).diff() != 1
mask2 = elig['memberid'] != elig['memberid'].shift()


# Convert the flag column to integer and take the cumulative sum.
# This converts the boolean flags into a column that assigns a 
# unique integer to each contiguous run of consecutive months belonging
# to the same memberid.

elig['run_id'] = (mask1 | mask2).astype(int).cumsum()

res = (
       elig.groupby('run_id')
           .agg({'memberid': 'first', 'monthid': ['first', 'last']})
           .reset_index(drop=True)
      )
res.columns = ['memberid', 'start_month', 'end_month']

res    
       memberid  start_month  end_month
    0         1       201711     201801
    1         1       201805     201807
    2         2       201810     201810

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM