将 function 应用到 dataframe 行使用结果用于下一行输入

Question

I am trying to create a rudimentary scheduling system.我正在尝试创建一个基本的调度系统。 Here is what I have so far:这是我到目前为止所拥有的：

I have a pandas dataframe job_data that looks like this:我有一个 pandas dataframe job_data看起来像这样：

wc厕所	job工作	start开始	duration期间
1 1	J1 J1	2022-08-16 07:30:00 2022-08-16 07:30:00	17 17
1 1	J2 J2	2022-08-16 07:30:00 2022-08-16 07:30:00	5 5
2 2	J3 J3	2022-08-16 07:30:00 2022-08-16 07:30:00	21 21
2 2	J4 J4	2022-08-16 07:30:00 2022-08-16 07:30:00	12 12

It contains a wc (work center), job, a start date and duration for the job in hours.它包含 wc（工作中心）、工作、工作的开始日期和持续时间（以小时为单位）。

I have created a function add_hours that takes the following arguments: start (datetime), hours (int).我创建了一个 function add_hours ，它采用以下 arguments：开始（日期时间），小时数（整数）。

It calculates the when the job will be complete based on the start time and duration .它根据开始时间和持续时间计算作业完成的时间。

The code for add_hours is: add_hours的代码是：

def is_in_open_hours(dt):
    return (
        dt.weekday() in business_hours["weekdays"]
        and dt.date() not in holidays
        and business_hours["from"].hour <= dt.time().hour < business_hours["to"].hour
    )


def get_next_open_datetime(dt):
    while True:
        dt = dt + timedelta(days=1)
        if dt.weekday() in business_hours["weekdays"] and dt.date() not in holidays:
            dt = datetime.combine(dt.date(), business_hours["from"])
            return dt


def add_hours(dt, hours):
    while hours != 0:
        if is_in_open_hours(dt):
            dt = dt + timedelta(hours=1)
            hours = hours - 1
        else:
            dt = get_next_open_datetime(dt)
    return dt

The code to calculate the end column is:计算结束列的代码是：

df["end"] = df.apply(lambda x: add_hours(x.start, x.duration), axis=1)

The result of function is the end column: function 的结果是结束列：

wc厕所	job工作	start开始	duration期间	end结尾
1 1	J1 J1	2022-08-16 07:30:00 2022-08-16 07:30:00	17 17	2022-08-17 14:00:00 2022-08-17 14:00:00
1 1	J2 J2	2022-08-16 07:30:00 2022-08-16 07:30:00	5 5	2022-08-17 10:00:00 2022-08-17 10:00:00
2 2	J3 J3	2022-08-16 07:30:00 2022-08-16 07:30:00	21 21	2022-08-18 08:00:00 2022-08-18 08:00:00
2 2	J4 J4	2022-08-16 07:30:00 2022-08-16 07:30:00	12 12	2022-08-18 08:00:00 2022-08-18 08:00:00

Problem is, I need the start datetime in the second row to be the end datetime from the previous row instead of them all using the same start date.问题是，我需要第二行中的开始日期时间是前一行的结束日期时间，而不是它们都使用相同的开始日期。 I also need to start this process over for each wc.我还需要为每个 wc 重新开始这个过程。

So the desired output would be:所以所需的 output 将是：

wc厕所	job工作	start开始	duration期间	end结尾
1 1	J1 J1	2022-08-16 07:30:00 2022-08-16 07:30:00	17 17	2022-08-17 14:00:00 2022-08-17 14:00:00
1 1	J2 J2	2022-08-17 14:00:00 2022-08-17 14:00:00	5 5	2022-08-17 19:00:00 2022-08-17 19:00:00
2 2	J3 J3	2022-08-16 07:30:00 2022-08-16 07:30:00	21 21	2022-08-18 08:00:00 2022-08-18 08:00:00
2 2	J4 J4	2022-08-18 08:00:00 2022-08-18 08:00:00	10 10	2022-08-18 18:00:00 2022-08-18 18:00:00

Answer 1

You can use Timedelta and groupby operations.您可以使用Timedelta和groupby操作。

As you did not provide your custom function, I'll apply here a simple addition of the duration:由于您没有提供您的自定义 function，我将在这里应用一个简单的持续时间添加：

df['start'] = pd.to_datetime(df['start'])

t = pd.to_timedelta(df['duration'], unit='h')
g = t.groupby(df['wc'])

df['start'] = df['start'].add(g.apply(lambda x: x.cumsum().shift(fill_value=pd.Timedelta('0'))))

df['end'] = df['start'].add(t)

Output: Output：

   wc job               start  duration                 end
0   1  J1 2022-08-16 07:30:00        17 2022-08-17 00:30:00
1   1  J2 2022-08-17 00:30:00         5 2022-08-17 05:30:00
2   2  J3 2022-08-16 07:30:00        21 2022-08-17 04:30:00
3   2  J4 2022-08-17 04:30:00        12 2022-08-17 16:30:00

Answer 2

I show an alternative method where you only need the first start date and then bootstrap the lists according to the job durations.我展示了一种替代方法，您只需要第first start date ，然后根据工作持续时间引导列表。


# import required modules
import io
import pandas as pd
from datetime import datetime
from datetime import timedelta

# make a dataframe
# note: only the first start date is required
x = '''
wc  job start   duration    end
1   J1  2022-08-16 07:30:00 17  2022-08-17 14:00:00
1   J2  2022-08-16 07:30:00 5   2022-08-17 10:00:00
2   J3  2022-08-16 07:30:00 21  2022-08-18 08:00:00
2   J4  2022-08-16 07:30:00 12  2022-08-18 08:00:00
'''
data = io.StringIO(x)

df = pd.read_csv(data, sep='\t')

# construct start and end lists
start = datetime.strptime(df['start'][0], '%Y-%m-%d %H:%M:%S')
start_list = [start]
end_list = []
for x in df['duration']:
    time_change = timedelta(hours=float(x))
    new_time = start_list[-1] + time_change
    start_list.append(new_time)
    end_list.append(new_time)

start_list.pop(-1)

# add to dataframe
df['start'] = start_list
df['end'] = end_list

# finished
df

The result is this:结果是这样的：

Answer 3

I'm not sure what's the size of your dataset, but if it's not too big you could use the following elegant solution (which would take quite a while to run because your'e replicating calculations)我不确定您的数据集的大小，但如果它不是太大，您可以使用以下优雅的解决方案（这将需要很长时间才能运行，因为您正在复制计算）

df['cum_duration'] = df.groupby('wc').duration.transform(sum)
df['end'] = df.apply(lambda x: add_hours(x.start, x.cum_duration), axis=1)

If the OP provides the business_hours df I could try to validate this solution如果 OP 提供business_hours df，我可以尝试验证此解决方案

将 function 应用到 dataframe 行使用结果用于下一行输入

问题描述

3 个解决方案

解决方案1
1 2022-08-23 17:37:09

解决方案2
0 2022-08-26 13:06:28

解决方案3
0 2022-08-31 12:07:23

将 function 应用到 dataframe 行使用结果用于下一行输入

问题描述

3 个解决方案

解决方案1 1 2022-08-23 17:37:09

解决方案2 0 2022-08-26 13:06:28

解决方案3 0 2022-08-31 12:07:23

解决方案1
1 2022-08-23 17:37:09

解决方案2
0 2022-08-26 13:06:28

解决方案3
0 2022-08-31 12:07:23