简体   繁体   中英

Count rows with consecutive dates within PANDAS groupby

This is the closest to what i'm looking for that I've found

Let's say my dataframe looks something like this:

d = {'item_number':['K208UL','AKD098008','DF900A','K208UL','AKD098008']
     'Comp_ID':['998798098','988797387','12398787','998798098','988797387']
     'date':['2016-11-12','2016-11-13','2016-11-17','2016-11-13','2016-11-14']}

df = pd.DataFrame(data=d)

I would like to count the amount of times where the same item_number and Comp_ID were observed on consecutive days.

I imagine this will look something along the lines of:

g = df.groupby(['Comp_ID','item_number'])
g.apply(lambda x: x.loc[x.iloc[i,'date'].shift(-1) - x.iloc[i,'date'] == 1].count())

However, I would need to extract the day from each date as an int before comparing, which I'm also having trouble with.

for i in df.index:
    wbc_seven.iloc[i, 'day_column'] = datetime.datetime.strptime(df.iloc[i,'date'],'%Y-%m-%d').day

Apparently location based indexing only allows for integers? How could I solve this problem?

One solution would be to use pivot tables to count the number of times a Comp_ID and an item_number were observed on consecutive days.

import pandas as pd

d = {'item_number':['K208UL','AKD098008','DF900A','K208UL','AKD098008'],'Comp_ID':['998798098','988797387','12398787','998798098','988797387'],'date':['2016-11-12','2016-11-13','2016-11-17','2016-11-13','2016-11-14']}

df = pd.DataFrame(data=d).sort_values(['item_number','Comp_ID'])
df['date'] = pd.to_datetime(df['date'])
df['delta'] = (df['date'] - df['date'].shift(1))
df = df[(df['delta']=='1 days 00:00:00.000000000') & (df['Comp_ID'] == df['Comp_ID'].shift(1)) & 
        (df['item_number'] == df['item_number'].shift(1))].pivot_table( index=['item_number','Comp_ID'],
           values=['date'],aggfunc='count').reset_index()
df.rename(columns={'date':'consecutive_days'},inplace =True)

Results in

  item_number    Comp_ID  consecutive_days
0   AKD098008  988797387                 1
1      K208UL  998798098                 1 

However, I would need to extract the day from each date as an int before comparing, which I'm also having trouble with.

Why?

To fix your code, you need:

consecutive['date'] = pd.to_datetime(consecutive['date'])
g = consecutive.groupby(['Comp_ID','item_number'])
g['date'].apply(lambda x: sum(abs((x.shift(-1) - x)) == pd.to_timedelta(1, unit='D')))

Note the following:

  1. The code above avoids repetitions. That is a basic programming principle: Don't Repeat Yourself
  2. It converts 1 to timedelta for proper comparison.
  3. It takes the absolute difference.

Tip, write a top level function for your work, instead of a lambda , as it accords better readability, brevity, and aesthetics:

def differencer(grp, day_dif):
    """Counts rows in grp separated by day_dif day(s)"""
    d = abs(grp.shift(-1) - grp)
    return sum(d == pd.to_timedelta(day_dif, unit='D'))
g['date'].apply(differencer, day_dif=1)

Explanation:

It is pretty straightforward. The dates are converted to Timestamp type , then subtracted. The difference will result in a timedelta , which needs to also be compared with a timedelta object, hence the conversion of 1 (or day_dif ) to timedelta . The result of that conversion will be a Boolean Series. Boolean are represented by 0 for False and 1 for True . Sum of a Boolean Series will return the total number of True values in the Series.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM