简体   繁体   中英

Changing dataframe number of weeks between dates calculation

I have a dataframe that looks like this

from pandas import Timestamp
df = pd.DataFrame({'inventory_created_date': [Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00')],
                  'rma_processed_date': [Timestamp('2017-09-25 00:00:00'),
                                         Timestamp('2018-01-08 00:00:00'),
                                         Timestamp('2018-04-21 00:00:00'),
                                         Timestamp('2018-08-10 00:00:00'),
                                         Timestamp('2018-10-17 00:00:00'),
                                         Timestamp('2018-11-08 00:00:00'),
                                         Timestamp('2019-07-18 00:00:00'),
                                         Timestamp('2020-01-30 00:00:00'),
                                         Timestamp('2020-04-20 00:00:00'),
                                         Timestamp('2020-06-09 00:00:00')], 
                  'uniqueid':['9907937959',
                             '9907937959',
                             '9907937959',
                             '9907937959',
                             '9907937959',
                             '9907937959',
                             '9907937959',
                             '9907937959',
                             '9907937959',
                             '9907937959'],
                  'rma_created_date':[Timestamp('2017-07-31 00:00:00'),
                                     Timestamp('2017-12-12 00:00:00'),
                                     Timestamp('2018-04-03 00:00:00'),
                                     Timestamp('2018-07-23 00:00:00'),
                                     Timestamp('2018-09-28 00:00:00'),
                                     Timestamp('2018-10-24 00:00:00'),
                                     Timestamp('2019-06-21 00:00:00'),
                                     Timestamp('2019-12-03 00:00:00'),
                                     Timestamp('2020-04-03 00:00:00'),
                                     Timestamp('2020-05-18 00:00:00')],
                  'time_in_weeks':[50, 69, 85, 101, 110, 114, 148, 172, 189, 196],
                  'failure_status':[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]})

I need to adjust the time_in_weeks numbers for every row after the first. What I need to do is for each row after the first I need to take the rma_created_date and the date rma_processed_date above that row and find the number of weeks between them.

For example, in the second row we have rma_created_date of 2017-12-12 and we have 'rma_processed_date' of 2017-09-25 in the first row. Thus the number of weeks in between these two dates is 11 . There fore the 69 in the second row should become an 11 .

Lets for another example. On the third row we have rma_created_date of 2018-04-03 and an rma_processed_date in the second row of 2018-01-08 . Thus the number of weeks in between these two dates is 12 . Therefore the 85 in the third row should become an 12 .

This is what I have done so far

def clean_df(df):
    '''
    This function will fix the time_in_weeks column to calculate the correct number of weeks
    when there is multiple failured for an item.
    '''
    
    # Sort by rma_created_date
    df = df.sort_values(by=['rma_created_date'])
    
    # Convert date columns into datetime
    df['inventory_created_date'] = pd.to_datetime(df['inventory_created_date'], errors='coerce')
    df['rma_processed_date'] = pd.to_datetime(df['rma_processed_date'], errors='coerce')
    df['rma_created_date'] = pd.to_datetime(df['rma_created_date'], errors='coerce')
    
    # If we have rma_processed_dates that are of 1/1/1900 then just drop that row
    df = df[~(df['rma_processed_date'] == '1900-01-01')]
    
    # Correct the time_in_weeks column
    df['time_in_weeks']=np.where(df.uniqueid.duplicated(keep='first'),df.rma_processed_date.dt.isocalendar().week.sub(df.rma_processed_date.dt.isocalendar().week.shift(1)),df.time_in_weeks)

    return df
df = clean_df(df)

When I apply this function to the example, this is what I get

df = pd.DataFrame({'inventory_created_date': [Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00')],
                  'rma_processed_date': [Timestamp('2017-09-25 00:00:00'),
                                         Timestamp('2018-01-08 00:00:00'),
                                         Timestamp('2018-04-21 00:00:00'),
                                         Timestamp('2018-08-10 00:00:00'),
                                         Timestamp('2018-10-17 00:00:00'),
                                         Timestamp('2018-11-08 00:00:00'),
                                         Timestamp('2019-07-18 00:00:00'),
                                         Timestamp('2020-01-30 00:00:00'),
                                         Timestamp('2020-04-20 00:00:00'),
                                         Timestamp('2020-06-09 00:00:00')], 
                  'uniqueid':['9907937959',
                             '9907937959',
                             '9907937959',
                             '9907937959',
                             '9907937959',
                             '9907937959',
                             '9907937959',
                             '9907937959',
                             '9907937959',
                             '9907937959'],
                  'rma_created_date':[Timestamp('2017-07-31 00:00:00'),
                                     Timestamp('2017-12-12 00:00:00'),
                                     Timestamp('2018-04-03 00:00:00'),
                                     Timestamp('2018-07-23 00:00:00'),
                                     Timestamp('2018-09-28 00:00:00'),
                                     Timestamp('2018-10-24 00:00:00'),
                                     Timestamp('2019-06-21 00:00:00'),
                                     Timestamp('2019-12-03 00:00:00'),
                                     Timestamp('2020-04-03 00:00:00'),
                                     Timestamp('2020-05-18 00:00:00')],
                  'time_in_weeks':[50, 4294967259, 14, 16, 10, 3, 4294967280, 4294967272, 12, 7],
                  'failure_status':[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]})

Obviously the calculation is incorrect, which leads me to believe there must be something wrong with this

df['time_in_weeks']=np.where(df.uniqueid.duplicated(keep='first'),df.rma_processed_date.dt.isocalendar().week.sub(df.rma_processed_date.dt.isocalendar().week.shift(1)),df.time_in_weeks)

If anyone has any suggestions I would greatly appreciate it.

The time_in_weeks column is expected to be [50, 11, 12, 13, 7, 1, 32, 20, 9, 4]

Let's shift the rma_processed_date then subtract it from rma_created_date finally get the days using .dt.days and divide by 7 to get number of weeks, finnaly use update to update the time_in_weeks column:

weeks = df['rma_created_date'].sub(df['rma_processed_date'].shift()).dt.days.div(7).round()
df['time_in_weeks'].update(weeks)

Result:

  inventory_created_date rma_processed_date    uniqueid rma_created_date  time_in_weeks  failure_status
0             2016-08-17         2017-09-25  9907937959       2017-07-31             50               1
1             2016-08-17         2018-01-08  9907937959       2017-12-12             11               1
2             2016-08-17         2018-04-21  9907937959       2018-04-03             12               1
3             2016-08-17         2018-08-10  9907937959       2018-07-23             13               1
4             2016-08-17         2018-10-17  9907937959       2018-09-28              7               1
5             2016-08-17         2018-11-08  9907937959       2018-10-24              1               1
6             2016-08-17         2019-07-18  9907937959       2019-06-21             32               1
7             2016-08-17         2020-01-30  9907937959       2019-12-03             20               1
8             2016-08-17         2020-04-20  9907937959       2020-04-03              9               1
9             2016-08-17         2020-06-09  9907937959       2020-05-18              4               1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM