[英]How to use date_range with day frequency?
I tried to use date_range
with a day
frequency on this dataframe:我尝试在此date_range
上使用日期范围和day
频率:
df = pd.DataFrame({'Start':['2022-06-07 06:24:48','2022-06-07 14:37:16','2022-06-07 08:00:59'],
'End':['2022-06-07 14:07:00','2022-06-08 02:51:21','2022-06-09 13:18:34'],
'Process':['PROD','VORG','STO'],
'Duration_Min':[462.20,734.08,3197.58]})
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])
df['difference']=df['End'].dt.date-df['Start'].dt.date
def find_interval(sr):
dti = pd.date_range(sr['Start'], sr['End'], freq='1D').normalize() + pd.Timedelta(days=1)
return list(zip([sr['Start']] + dti.tolist(), dti.tolist() + [sr['End']]))
df1 = df.apply(find_interval, axis=1).explode().apply(pd.Series)
df1 = df.drop(columns=['Start', 'End']).join(df1).rename(columns={0: 'Start', 1: 'End'})
df1['Duration_Min']=(df1["End"]-df1["Start"]).dt.total_seconds().div(60)
What I get is:我得到的是:
Process Duration_Min difference Start End
0 PROD 1055.200000 0 days 2022-06-07 06:24:48 2022-06-08 00:00:00
0 PROD -593.000000 0 days 2022-06-08 00:00:00 2022-06-07 14:07:00
1 VORG 562.733333 1 days 2022-06-07 14:37:16 2022-06-08 00:00:00
1 VORG 171.350000 1 days 2022-06-08 00:00:00 2022-06-08 02:51:21
2 STO 959.016667 2 days 2022-06-07 08:00:59 2022-06-08 00:00:00
2 STO 1440.000000 2 days 2022-06-08 00:00:00 2022-06-09 00:00:00
2 STO 1440.000000 2 days 2022-06-09 00:00:00 2022-06-10 00:00:00
2 STO -641.433333 2 days 2022-06-10 00:00:00 2022-06-09 13:18:34
I would like to cut the events so that new timestamps with new intervals are created when the day changes between Start
and End
.我想削减事件,以便在Start
和End
之间的日期变化时创建具有新间隔的新时间戳。 If the difference between the dates is 0 days
I don't need to create new timestamps and with the Timedelta(days=1)
the End
timestamp is mismatched.如果日期之间的差异是0 days
,我不需要创建新的时间戳,并且Timedelta(days=1)
的End
时间戳不匹配。 The column Days
should be corosponding with weekday()
Days
列应与weekday()
对应
What I want:我想要的是:
Start End Process Duration_Min Days
0 2022-06-07 06:24:48 2022-06-07 14:07:00 PROD 462.200000 1
1 2022-06-07 14:37:16 2022-06-07 23:59:59 VORG 562.716667 1
2 2022-06-08 00:00:00 2022-06-08 02:51:21 VORG 171.350000 2
3 2022-06-07 08:00:59 2022-06-07 23:59:59 STO 959.000000 1
4 2022-06-08 00:00:00 2022-06-08 23:59:59 STO 1439.983333 2
5 2022-06-09 00:00:00 2022-06-09 13:18:34 STO 798.566667 3
How could I achieve this?我怎么能做到这一点?
You could try:你可以试试:
def find_interval(row):
start, end = row.at["Start"], row.at["End"]
days = pd.date_range(start, end, freq="D", normalize=True).to_list()
if len(days) == 1 or days[-1] != end:
days.append(end)
days[0] = start
return list(zip(days, days[1:]))
result = (
df
.assign(Days=df.apply(find_interval, axis=1))
.explode("Days")
.assign(
Start=lambda df: df["Days"].str[0],
End=lambda df: df["Days"].str[1],
Duration_Min=lambda df:
(df["End"] - df["Start"]).dt.total_seconds().div(60),
Days=lambda df: df.groupby("Process").transform("cumcount") + 1
)
)
Result for your df
:您的df
结果:
Start End Process Duration_Min Days
0 2022-06-07 06:24:48 2022-06-07 14:07:00 PROD 462.200000 1
1 2022-06-07 14:37:16 2022-06-08 00:00:00 VORG 562.733333 1
1 2022-06-08 00:00:00 2022-06-08 02:51:21 VORG 171.350000 2
2 2022-06-07 08:00:59 2022-06-08 00:00:00 STO 959.016667 1
2 2022-06-08 00:00:00 2022-06-09 00:00:00 STO 1440.000000 2
2 2022-06-09 00:00:00 2022-06-09 13:18:34 STO 798.566667 3
If, as indicated in the comments, the substantial part of df
doesn't need the day-separation, then the following might be better:如果如评论中所示, df
的大部分不需要日间分隔,那么以下可能会更好:
m = df["Start"].dt.date < df["End"].dt.date
result = (
df[m]
.assign(Days=df.apply(find_interval, axis=1))
... <see above> ...
)
result = pd.concat([df[~m].assign(Days=1), result]).sort_index()
The .sort_index()
-part is to make sure that the Process
-order is the same as in df
. .sort_index()
部分是为了确保Process
-order 与df
中的相同。 Remove it, if that's not important.删除它,如果这不重要的话。
Alright so the first thing you want to do is have a date column:好的,所以您要做的第一件事是有一个日期列:
df["date"]=df[["Start","End"]].min(axis=1).dt.date
Once you have that, you will now want to groupby according to your relevant columns一旦你有了它,你现在需要根据你的相关列进行分组
df = df.groupby(["date",
"Process"]).agg({"Start":"min","End":"min","Duration_Min":"sum", "Days":"any"}).reset_index()
And you should end up with the relevant dataframe你最终应该得到相关的 dataframe
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.