简体   繁体   English

更改 dataframe 日期之间的周数计算

[英]Changing dataframe number of weeks between dates calculation

I have a dataframe that looks like this我有一个看起来像这样的 dataframe

from pandas import Timestamp
df = pd.DataFrame({'inventory_created_date': [Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00')],
                  'rma_processed_date': [Timestamp('2017-09-25 00:00:00'),
                                         Timestamp('2018-01-08 00:00:00'),
                                         Timestamp('2018-04-21 00:00:00'),
                                         Timestamp('2018-08-10 00:00:00'),
                                         Timestamp('2018-10-17 00:00:00'),
                                         Timestamp('2018-11-08 00:00:00'),
                                         Timestamp('2019-07-18 00:00:00'),
                                         Timestamp('2020-01-30 00:00:00'),
                                         Timestamp('2020-04-20 00:00:00'),
                                         Timestamp('2020-06-09 00:00:00')], 
                  'uniqueid':['9907937959',
                             '9907937959',
                             '9907937959',
                             '9907937959',
                             '9907937959',
                             '9907937959',
                             '9907937959',
                             '9907937959',
                             '9907937959',
                             '9907937959'],
                  'rma_created_date':[Timestamp('2017-07-31 00:00:00'),
                                     Timestamp('2017-12-12 00:00:00'),
                                     Timestamp('2018-04-03 00:00:00'),
                                     Timestamp('2018-07-23 00:00:00'),
                                     Timestamp('2018-09-28 00:00:00'),
                                     Timestamp('2018-10-24 00:00:00'),
                                     Timestamp('2019-06-21 00:00:00'),
                                     Timestamp('2019-12-03 00:00:00'),
                                     Timestamp('2020-04-03 00:00:00'),
                                     Timestamp('2020-05-18 00:00:00')],
                  'time_in_weeks':[50, 69, 85, 101, 110, 114, 148, 172, 189, 196],
                  'failure_status':[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]})

I need to adjust the time_in_weeks numbers for every row after the first.我需要在第一行之后调整每一行的time_in_weeks数字。 What I need to do is for each row after the first I need to take the rma_created_date and the date rma_processed_date above that row and find the number of weeks between them.我需要做的是在第一行之后的每一行,我需要在该行上方获取rma_created_date和日期rma_processed_date并找到它们之间的周数。

For example, in the second row we have rma_created_date of 2017-12-12 and we have 'rma_processed_date' of 2017-09-25 in the first row.例如,在第二行中,我们的rma_created_date2017-12-12 ,第一行中的“rma_processed_date”为2017-09-25 Thus the number of weeks in between these two dates is 11 .因此,这两个日期之间的周数为11 There fore the 69 in the second row should become an 11 .因此,第二排的69应该变成11

Lets for another example.让我们再举一个例子。 On the third row we have rma_created_date of 2018-04-03 and an rma_processed_date in the second row of 2018-01-08 .在第三行,我们的rma_created_date2018-04-03 ,第二行的2018-01-08rma_processed_date Thus the number of weeks in between these two dates is 12 .因此,这两个日期之间的周数为12 Therefore the 85 in the third row should become an 12 .因此,第三排的85应该变成12

This is what I have done so far这是我到目前为止所做的

def clean_df(df):
    '''
    This function will fix the time_in_weeks column to calculate the correct number of weeks
    when there is multiple failured for an item.
    '''
    
    # Sort by rma_created_date
    df = df.sort_values(by=['rma_created_date'])
    
    # Convert date columns into datetime
    df['inventory_created_date'] = pd.to_datetime(df['inventory_created_date'], errors='coerce')
    df['rma_processed_date'] = pd.to_datetime(df['rma_processed_date'], errors='coerce')
    df['rma_created_date'] = pd.to_datetime(df['rma_created_date'], errors='coerce')
    
    # If we have rma_processed_dates that are of 1/1/1900 then just drop that row
    df = df[~(df['rma_processed_date'] == '1900-01-01')]
    
    # Correct the time_in_weeks column
    df['time_in_weeks']=np.where(df.uniqueid.duplicated(keep='first'),df.rma_processed_date.dt.isocalendar().week.sub(df.rma_processed_date.dt.isocalendar().week.shift(1)),df.time_in_weeks)

    return df
df = clean_df(df)

When I apply this function to the example, this is what I get当我将此 function 应用于示例时,这就是我得到的

df = pd.DataFrame({'inventory_created_date': [Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00'),
                                             Timestamp('2016-08-17 00:00:00')],
                  'rma_processed_date': [Timestamp('2017-09-25 00:00:00'),
                                         Timestamp('2018-01-08 00:00:00'),
                                         Timestamp('2018-04-21 00:00:00'),
                                         Timestamp('2018-08-10 00:00:00'),
                                         Timestamp('2018-10-17 00:00:00'),
                                         Timestamp('2018-11-08 00:00:00'),
                                         Timestamp('2019-07-18 00:00:00'),
                                         Timestamp('2020-01-30 00:00:00'),
                                         Timestamp('2020-04-20 00:00:00'),
                                         Timestamp('2020-06-09 00:00:00')], 
                  'uniqueid':['9907937959',
                             '9907937959',
                             '9907937959',
                             '9907937959',
                             '9907937959',
                             '9907937959',
                             '9907937959',
                             '9907937959',
                             '9907937959',
                             '9907937959'],
                  'rma_created_date':[Timestamp('2017-07-31 00:00:00'),
                                     Timestamp('2017-12-12 00:00:00'),
                                     Timestamp('2018-04-03 00:00:00'),
                                     Timestamp('2018-07-23 00:00:00'),
                                     Timestamp('2018-09-28 00:00:00'),
                                     Timestamp('2018-10-24 00:00:00'),
                                     Timestamp('2019-06-21 00:00:00'),
                                     Timestamp('2019-12-03 00:00:00'),
                                     Timestamp('2020-04-03 00:00:00'),
                                     Timestamp('2020-05-18 00:00:00')],
                  'time_in_weeks':[50, 4294967259, 14, 16, 10, 3, 4294967280, 4294967272, 12, 7],
                  'failure_status':[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]})

Obviously the calculation is incorrect, which leads me to believe there must be something wrong with this显然计算不正确,这让我相信这一定有问题

df['time_in_weeks']=np.where(df.uniqueid.duplicated(keep='first'),df.rma_processed_date.dt.isocalendar().week.sub(df.rma_processed_date.dt.isocalendar().week.shift(1)),df.time_in_weeks)

If anyone has any suggestions I would greatly appreciate it.如果有人有任何建议,我将不胜感激。

The time_in_weeks column is expected to be [50, 11, 12, 13, 7, 1, 32, 20, 9, 4] time_in_weeks列预计为[50, 11, 12, 13, 7, 1, 32, 20, 9, 4]

Let's shift the rma_processed_date then subtract it from rma_created_date finally get the days using .dt.days and divide by 7 to get number of weeks, finnaly use update to update the time_in_weeks column:让我们shift rma_processed_date然后从rma_created_date中减去它,最后使用.dt.days得到天数并除以7得到周数,最后使用update更新time_in_weeks列:

weeks = df['rma_created_date'].sub(df['rma_processed_date'].shift()).dt.days.div(7).round()
df['time_in_weeks'].update(weeks)

Result:结果:

  inventory_created_date rma_processed_date    uniqueid rma_created_date  time_in_weeks  failure_status
0             2016-08-17         2017-09-25  9907937959       2017-07-31             50               1
1             2016-08-17         2018-01-08  9907937959       2017-12-12             11               1
2             2016-08-17         2018-04-21  9907937959       2018-04-03             12               1
3             2016-08-17         2018-08-10  9907937959       2018-07-23             13               1
4             2016-08-17         2018-10-17  9907937959       2018-09-28              7               1
5             2016-08-17         2018-11-08  9907937959       2018-10-24              1               1
6             2016-08-17         2019-07-18  9907937959       2019-06-21             32               1
7             2016-08-17         2020-01-30  9907937959       2019-12-03             20               1
8             2016-08-17         2020-04-20  9907937959       2020-04-03              9               1
9             2016-08-17         2020-06-09  9907937959       2020-05-18              4               1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM