简体   繁体   English

如何将 date_range 与日频率一起使用?

[英]How to use date_range with day frequency?

I tried to use date_range with a day frequency on this dataframe:我尝试在此date_range上使用日期范围和day频率:

df = pd.DataFrame({'Start':['2022-06-07 06:24:48','2022-06-07 14:37:16','2022-06-07 08:00:59'],
                   'End':['2022-06-07 14:07:00','2022-06-08 02:51:21','2022-06-09 13:18:34'],
                   'Process':['PROD','VORG','STO'], 
                   'Duration_Min':[462.20,734.08,3197.58]})

df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])
df['difference']=df['End'].dt.date-df['Start'].dt.date

def find_interval(sr):
    dti = pd.date_range(sr['Start'], sr['End'], freq='1D').normalize() + pd.Timedelta(days=1)
    return list(zip([sr['Start']] + dti.tolist(), dti.tolist() + [sr['End']]))

df1 = df.apply(find_interval, axis=1).explode().apply(pd.Series)
df1 = df.drop(columns=['Start', 'End']).join(df1).rename(columns={0: 'Start', 1: 'End'})
df1['Duration_Min']=(df1["End"]-df1["Start"]).dt.total_seconds().div(60)

What I get is:我得到的是:

Process  Duration_Min difference               Start                 End
0    PROD   1055.200000     0 days 2022-06-07 06:24:48 2022-06-08 00:00:00
0    PROD   -593.000000     0 days 2022-06-08 00:00:00 2022-06-07 14:07:00
1    VORG    562.733333     1 days 2022-06-07 14:37:16 2022-06-08 00:00:00
1    VORG    171.350000     1 days 2022-06-08 00:00:00 2022-06-08 02:51:21
2     STO    959.016667     2 days 2022-06-07 08:00:59 2022-06-08 00:00:00
2     STO   1440.000000     2 days 2022-06-08 00:00:00 2022-06-09 00:00:00
2     STO   1440.000000     2 days 2022-06-09 00:00:00 2022-06-10 00:00:00
2     STO   -641.433333     2 days 2022-06-10 00:00:00 2022-06-09 13:18:34

I would like to cut the events so that new timestamps with new intervals are created when the day changes between Start and End .我想削减事件,以便在StartEnd之间的日期变化时创建具有新间隔的新时间戳。 If the difference between the dates is 0 days I don't need to create new timestamps and with the Timedelta(days=1) the End timestamp is mismatched.如果日期之间的差异是0 days ,我不需要创建新的时间戳,并且Timedelta(days=1)End时间戳不匹配。 The column Days should be corosponding with weekday() Days列应与weekday()对应

What I want:我想要的是:

Start                 End                   Process  Duration_Min  Days
0 2022-06-07 06:24:48 2022-06-07 14:07:00    PROD    462.200000     1
1 2022-06-07 14:37:16 2022-06-07 23:59:59    VORG    562.716667     1
2 2022-06-08 00:00:00 2022-06-08 02:51:21    VORG    171.350000     2
3 2022-06-07 08:00:59 2022-06-07 23:59:59     STO    959.000000     1
4 2022-06-08 00:00:00 2022-06-08 23:59:59     STO   1439.983333     2
5 2022-06-09 00:00:00 2022-06-09 13:18:34     STO    798.566667     3

How could I achieve this?我怎么能做到这一点?

You could try:你可以试试:

def find_interval(row):
    start, end = row.at["Start"], row.at["End"]
    days = pd.date_range(start, end, freq="D", normalize=True).to_list()
    if len(days) == 1 or days[-1] != end:
        days.append(end)
    days[0] = start
    return list(zip(days, days[1:]))

result = (
    df
    .assign(Days=df.apply(find_interval, axis=1))
    .explode("Days")
    .assign(
        Start=lambda df: df["Days"].str[0],
        End=lambda df: df["Days"].str[1],
        Duration_Min=lambda df:
            (df["End"] - df["Start"]).dt.total_seconds().div(60),
        Days=lambda df: df.groupby("Process").transform("cumcount") + 1
    )
)

Result for your df :您的df结果:

                 Start                  End Process  Duration_Min  Days
0  2022-06-07 06:24:48  2022-06-07 14:07:00    PROD    462.200000     1
1  2022-06-07 14:37:16  2022-06-08 00:00:00    VORG    562.733333     1
1  2022-06-08 00:00:00  2022-06-08 02:51:21    VORG    171.350000     2
2  2022-06-07 08:00:59  2022-06-08 00:00:00     STO    959.016667     1
2  2022-06-08 00:00:00  2022-06-09 00:00:00     STO   1440.000000     2
2  2022-06-09 00:00:00  2022-06-09 13:18:34     STO    798.566667     3

If, as indicated in the comments, the substantial part of df doesn't need the day-separation, then the following might be better:如果如评论中所示, df的大部分不需要日间分隔,那么以下可能会更好:

m = df["Start"].dt.date < df["End"].dt.date
result = (
    df[m]
    .assign(Days=df.apply(find_interval, axis=1))
    ... <see above> ...
)
result = pd.concat([df[~m].assign(Days=1), result]).sort_index()

The .sort_index() -part is to make sure that the Process -order is the same as in df . .sort_index()部分是为了确保Process -order 与df中的相同。 Remove it, if that's not important.删除它,如果这不重要的话。

Alright so the first thing you want to do is have a date column:好的,所以您要做的第一件事是有一个日期列:

df["date"]=df[["Start","End"]].min(axis=1).dt.date

Once you have that, you will now want to groupby according to your relevant columns一旦你有了它,你现在需要根据你的相关列进行分组

    df = df.groupby(["date",
"Process"]).agg({"Start":"min","End":"min","Duration_Min":"sum", "Days":"any"}).reset_index()

And you should end up with the relevant dataframe你最终应该得到相关的 dataframe

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM