简体   繁体   中英

Date interval average Python pandas

This is my dataframe:

ID number Date purchase
1 2022-05-01
1 2021-03-03
1 2020-01-03
2 2019-01-03
2 2018-01-03

I want to get a horizontal dataframe with alle the dates in seperate columns per ID number.

So like this:

ID number Date 1 Date 2 Date 3
1 2022-05-01 2021-03-03 2020-01-03
2 2019-01-03 2018-01-03

After I did this I want to calculate the difference between these dates.

First step is GroupBy.cumcount with DataFrame.pivot :

df['Date purchase'] = pd.to_datetime(df['Date purchase'])
df1 = (df.sort_values(by=['ID number', 'Date purchase'], ascending=[True, False])
         .assign(g=lambda x: x.groupby('ID number').cumcount())
         .pivot('ID number','g','Date purchase')
         .rename(columns = lambda x: f'Date {x + 1}'))
print (df1)
g             Date 1     Date 2     Date 3
ID number                                 
1         2022-05-01 2021-03-03 2020-01-03
2         2019-01-03 2018-01-03        NaT

Then for differencies between columns use DataFrame.diff :

df2 = df1.diff(-1, axis=1)
print (df2)
g           Date 1   Date 2 Date 3
ID number                         
1         424 days 425 days    NaT
2         365 days      NaT    NaT

If need averages:

df3 = df1.apply(pd.Series.mean, axis=1).reset_index(name='Avg Dates').rename_axis(None, axis=1)
print (df3)
   ID number           Avg Dates
0          1 2021-03-02 16:00:00
1          2 2018-07-04 12:00:00

Could you do something like this?

def format_dataframe(df):
    """
    Function formats the dataframe to the following:
    | ID number| Date 1         | Date 2         | Date 3         |
    | -------- | -------------- | -------------- | -------------- |
    | 1        | 2022-05-01     | 2021-03-03     | 2020-01-03     |
    | 2        | 2019-01-03     | 2018-01-03     |                |
    """
    df = df.sort_values(by=['ID number', 'Date purchase'])
    df = df.drop_duplicates(subset=['ID number'], keep='first')
    df = df.drop(columns=['Date purchase'])
    df = df.rename(columns={'ID number': 'ID number', 'Date 1': 'Date 1', 'Date 2': 'Date 2', 'Date 3': 'Date 3'})
    return df

initial situation:

d = {'IdNumber': [1,1,1,2,2], 'Date': ['2022-05-01', '2021-03-03','2020-01-03','2019-01-03','2018-01-03']}
df = pd.DataFrame(data=d)

date conversion:

df['Date'] = pd.to_datetime(df['Date'])

creating new column:

df1=df.assign(Col=lambda x: x.groupby('IdNumber').cumcount())

pivoting:

df1=df1.pivot(index=["IdNumber"],columns=["Col"],values="Date")

reset index:

df1 = df1.reset_index(level=0)

rename column:

for i in range(1,len(df1.columns)):
    df1.columns.values[i]='Date{0}'.format(i)

final result:

Col IdNumber    Date1        Date2       Date3
0    1          2022-05-01  2021-03-03  2020-01-03
1    2          2019-01-03  2018-01-03  NaT

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM