![](/img/trans.png)
[英]Calculation of days and weeks from some dates in DataFrame in Python Pandas?
[英]Changing dataframe number of weeks between dates calculation
我有一個看起來像這樣的 dataframe
from pandas import Timestamp
df = pd.DataFrame({'inventory_created_date': [Timestamp('2016-08-17 00:00:00'),
Timestamp('2016-08-17 00:00:00'),
Timestamp('2016-08-17 00:00:00'),
Timestamp('2016-08-17 00:00:00'),
Timestamp('2016-08-17 00:00:00'),
Timestamp('2016-08-17 00:00:00'),
Timestamp('2016-08-17 00:00:00'),
Timestamp('2016-08-17 00:00:00'),
Timestamp('2016-08-17 00:00:00'),
Timestamp('2016-08-17 00:00:00')],
'rma_processed_date': [Timestamp('2017-09-25 00:00:00'),
Timestamp('2018-01-08 00:00:00'),
Timestamp('2018-04-21 00:00:00'),
Timestamp('2018-08-10 00:00:00'),
Timestamp('2018-10-17 00:00:00'),
Timestamp('2018-11-08 00:00:00'),
Timestamp('2019-07-18 00:00:00'),
Timestamp('2020-01-30 00:00:00'),
Timestamp('2020-04-20 00:00:00'),
Timestamp('2020-06-09 00:00:00')],
'uniqueid':['9907937959',
'9907937959',
'9907937959',
'9907937959',
'9907937959',
'9907937959',
'9907937959',
'9907937959',
'9907937959',
'9907937959'],
'rma_created_date':[Timestamp('2017-07-31 00:00:00'),
Timestamp('2017-12-12 00:00:00'),
Timestamp('2018-04-03 00:00:00'),
Timestamp('2018-07-23 00:00:00'),
Timestamp('2018-09-28 00:00:00'),
Timestamp('2018-10-24 00:00:00'),
Timestamp('2019-06-21 00:00:00'),
Timestamp('2019-12-03 00:00:00'),
Timestamp('2020-04-03 00:00:00'),
Timestamp('2020-05-18 00:00:00')],
'time_in_weeks':[50, 69, 85, 101, 110, 114, 148, 172, 189, 196],
'failure_status':[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]})
我需要在第一行之后調整每一行的time_in_weeks
數字。 我需要做的是在第一行之后的每一行,我需要在該行上方獲取rma_created_date
和日期rma_processed_date
並找到它們之間的周數。
例如,在第二行中,我們的rma_created_date
為2017-12-12
,第一行中的“rma_processed_date”為2017-09-25
。 因此,這兩個日期之間的周數為11
。 因此,第二排的69
應該變成11
。
讓我們再舉一個例子。 在第三行,我們的rma_created_date
為2018-04-03
,第二行的2018-01-08
為rma_processed_date
。 因此,這兩個日期之間的周數為12
。 因此,第三排的85
應該變成12
。
這是我到目前為止所做的
def clean_df(df):
'''
This function will fix the time_in_weeks column to calculate the correct number of weeks
when there is multiple failured for an item.
'''
# Sort by rma_created_date
df = df.sort_values(by=['rma_created_date'])
# Convert date columns into datetime
df['inventory_created_date'] = pd.to_datetime(df['inventory_created_date'], errors='coerce')
df['rma_processed_date'] = pd.to_datetime(df['rma_processed_date'], errors='coerce')
df['rma_created_date'] = pd.to_datetime(df['rma_created_date'], errors='coerce')
# If we have rma_processed_dates that are of 1/1/1900 then just drop that row
df = df[~(df['rma_processed_date'] == '1900-01-01')]
# Correct the time_in_weeks column
df['time_in_weeks']=np.where(df.uniqueid.duplicated(keep='first'),df.rma_processed_date.dt.isocalendar().week.sub(df.rma_processed_date.dt.isocalendar().week.shift(1)),df.time_in_weeks)
return df
df = clean_df(df)
當我將此 function 應用於示例時,這就是我得到的
df = pd.DataFrame({'inventory_created_date': [Timestamp('2016-08-17 00:00:00'),
Timestamp('2016-08-17 00:00:00'),
Timestamp('2016-08-17 00:00:00'),
Timestamp('2016-08-17 00:00:00'),
Timestamp('2016-08-17 00:00:00'),
Timestamp('2016-08-17 00:00:00'),
Timestamp('2016-08-17 00:00:00'),
Timestamp('2016-08-17 00:00:00'),
Timestamp('2016-08-17 00:00:00'),
Timestamp('2016-08-17 00:00:00')],
'rma_processed_date': [Timestamp('2017-09-25 00:00:00'),
Timestamp('2018-01-08 00:00:00'),
Timestamp('2018-04-21 00:00:00'),
Timestamp('2018-08-10 00:00:00'),
Timestamp('2018-10-17 00:00:00'),
Timestamp('2018-11-08 00:00:00'),
Timestamp('2019-07-18 00:00:00'),
Timestamp('2020-01-30 00:00:00'),
Timestamp('2020-04-20 00:00:00'),
Timestamp('2020-06-09 00:00:00')],
'uniqueid':['9907937959',
'9907937959',
'9907937959',
'9907937959',
'9907937959',
'9907937959',
'9907937959',
'9907937959',
'9907937959',
'9907937959'],
'rma_created_date':[Timestamp('2017-07-31 00:00:00'),
Timestamp('2017-12-12 00:00:00'),
Timestamp('2018-04-03 00:00:00'),
Timestamp('2018-07-23 00:00:00'),
Timestamp('2018-09-28 00:00:00'),
Timestamp('2018-10-24 00:00:00'),
Timestamp('2019-06-21 00:00:00'),
Timestamp('2019-12-03 00:00:00'),
Timestamp('2020-04-03 00:00:00'),
Timestamp('2020-05-18 00:00:00')],
'time_in_weeks':[50, 4294967259, 14, 16, 10, 3, 4294967280, 4294967272, 12, 7],
'failure_status':[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]})
顯然計算不正確,這讓我相信這一定有問題
df['time_in_weeks']=np.where(df.uniqueid.duplicated(keep='first'),df.rma_processed_date.dt.isocalendar().week.sub(df.rma_processed_date.dt.isocalendar().week.shift(1)),df.time_in_weeks)
如果有人有任何建議,我將不勝感激。
time_in_weeks
列預計為[50, 11, 12, 13, 7, 1, 32, 20, 9, 4]
讓我們shift
rma_processed_date
然后從rma_created_date
中減去它,最后使用.dt.days
得到天數並除以7
得到周數,最后使用update
更新time_in_weeks
列:
weeks = df['rma_created_date'].sub(df['rma_processed_date'].shift()).dt.days.div(7).round()
df['time_in_weeks'].update(weeks)
結果:
inventory_created_date rma_processed_date uniqueid rma_created_date time_in_weeks failure_status
0 2016-08-17 2017-09-25 9907937959 2017-07-31 50 1
1 2016-08-17 2018-01-08 9907937959 2017-12-12 11 1
2 2016-08-17 2018-04-21 9907937959 2018-04-03 12 1
3 2016-08-17 2018-08-10 9907937959 2018-07-23 13 1
4 2016-08-17 2018-10-17 9907937959 2018-09-28 7 1
5 2016-08-17 2018-11-08 9907937959 2018-10-24 1 1
6 2016-08-17 2019-07-18 9907937959 2019-06-21 32 1
7 2016-08-17 2020-01-30 9907937959 2019-12-03 20 1
8 2016-08-17 2020-04-20 9907937959 2020-04-03 9 1
9 2016-08-17 2020-06-09 9907937959 2020-05-18 4 1
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.