![](/img/trans.png)
[英]Pandas Create new column based on a count and a condition from another dataframe
[英]Pandas dataframe create new rows based on condition from another column
我有数据框:
data = {'startTime':['01-06-2010 09:00:00', '13-02-2016 09:00:00', '18-03-2018 09:00:00', '23-05-2011 09:00:00'], 'endTime':['02-06-2010 17:00:00', '14-02-2016 17:00:00', '19-03-2018 17:00:00', '24-05-2011 17:00:00'], 'durationInMinutes': [1440, 1440, 1440, 1440]}
df = pd.DataFrame(data)
我想通过将 1440 分钟划分为每天 8 小时的相等间隔来增加行。 因此,1440 分钟等于 3 天(另外 3 行)(上午 9 点 - 下午 5 点)。 分钟可以超过 1440。 startTime 和 endTime 中的新行将是:
startTime endTime
01-06-2010 09:00:00 01-06-2010 17:00:00
02-06-2010 09:00:00 02-06-2010 17:00:00
03-06-2010 09:00:00 03-06-2010 17:00:00
13-02-2016 09:00:00 13-02-2016 17:00:00
14-02-2016 09:00:00 14-02-2016 17:00:00
15-02-2016 09:00:00 15-02-2016 17:00:00
谁能帮我解决这个问题? 谢谢。
IIUC,您可以使用:
# ensure datetime
df[['startTime', 'endTime']] = df[['startTime', 'endTime']].apply(pd.to_datetime, dayfirst=True)
# compute number of rows in days
extra = np.ceil(df['durationInMinutes'].div(60*8)).astype(int)
# compute a shift (+0, +1, +2days etc.)
shift = extra.repeat(extra).groupby(level=0).cumcount().mul(pd.Timedelta('1day'))
# duplicate the rows
df2 = df.loc[df.index.repeat(extra)].reset_index(drop=True)
# add the shift
df2[['startTime', 'endTime']] = df2[['startTime', 'endTime']].add(shift.values, axis=0)
print(df2)
输出:
startTime endTime durationInMinutes
0 2010-06-01 09:00:00 2010-06-02 17:00:00 1440
1 2010-06-02 09:00:00 2010-06-03 17:00:00 1440
2 2010-06-03 09:00:00 2010-06-04 17:00:00 1440
3 2016-02-13 09:00:00 2016-02-14 17:00:00 1440
4 2016-02-14 09:00:00 2016-02-15 17:00:00 1440
5 2016-02-15 09:00:00 2016-02-16 17:00:00 1440
6 2018-03-18 09:00:00 2018-03-19 17:00:00 1440
7 2018-03-19 09:00:00 2018-03-20 17:00:00 1440
8 2018-03-20 09:00:00 2018-03-21 17:00:00 1440
9 2011-05-23 09:00:00 2011-05-24 17:00:00 1440
10 2011-05-24 09:00:00 2011-05-25 17:00:00 1440
11 2011-05-25 09:00:00 2011-05-26 17:00:00 1440
更新#2:
正如多个OP评论所阐明的那样,这是一种解决问题的方法,即:
durationInMinutes
列值匹配输入。durationInMinutes
以匹配上述逻辑。startTime
为相应日期的09:00
。endTime
是比startTime
晚的durationInMinutes
周期,以匹配上述逻辑。df['days'] = df.durationInMinutes // (8 * 60) + (df.durationInMinutes % (8 * 60) > 0)
df['durationInMinutes'] = df.apply(lambda x: [8 * 60] * (x.days - 1) +
[x.durationInMinutes % (8 * 60) + (x.durationInMinutes % (8 * 60) == 0) * 8 * 60], axis=1)
df['daysToAdd'] = df.days.apply(lambda x: range(x))
df = df.explode(['durationInMinutes', 'daysToAdd'])
df.startTime = pd.to_datetime(df.startTime, dayfirst=True)
df.startTime = pd.to_datetime(pd.DataFrame({
'year':df.startTime.dt.year, 'month':df.startTime.dt.month,
'day':df.startTime.dt.day + df.daysToAdd,
'hour':[9]*len(df.index)}))
df.endTime = (df.startTime.astype('int64') +
df.durationInMinutes * 60*1_000_000_000).astype('datetime64[ns]')
df = df.drop(columns=['days', 'daysToAdd']).reset_index(drop=True)
输入:
startTime endTime durationInMinutes
0 01-06-2010 09:00:00 02-06-2010 17:00:00 475
1 13-02-2016 08:30:00 14-02-2016 17:00:00 510
2 18-03-2018 09:30:00 19-03-2018 17:00:00 1440
3 23-05-2011 09:00:00 24-05-2011 17:00:00 1440
输出:
startTime endTime durationInMinutes
0 2010-06-01 09:00:00 2010-06-01 16:55:00 475
1 2016-02-13 09:00:00 2016-02-13 17:00:00 480
2 2016-02-14 09:00:00 2016-02-14 09:30:00 30
3 2018-03-18 09:00:00 2018-03-18 17:00:00 480
4 2018-03-19 09:00:00 2018-03-19 17:00:00 480
5 2018-03-20 09:00:00 2018-03-20 17:00:00 480
6 2011-05-23 09:00:00 2011-05-23 17:00:00 480
7 2011-05-24 09:00:00 2011-05-24 17:00:00 480
8 2011-05-25 09:00:00 2011-05-25 17:00:00 480
更新:这是一种执行您在问题中提出并在您的评论中澄清的方法:
df['days'] = df.durationInMinutes // (8 * 60) +
(df.durationInMinutes % (8 * 60) > 0).astype(int)
df['durationInMinutes'] = df.apply(lambda x: [8 * 60] * (x.days - 1) +
[x.durationInMinutes % (8 * 60) +
(x.durationInMinutes % (8 * 60) == 0) * 8 * 60], axis=1)
df['daysToAdd'] = df.days.apply(lambda x: range(x))
df = df.explode(['durationInMinutes', 'daysToAdd'])
df.startTime = (pd.to_datetime(df.startTime, dayfirst=True).astype('int64') +
df.daysToAdd * 24*60*60*1_000_000_000).astype('datetime64[ns]')
df.endTime = (df.startTime.astype('int64') +
df.durationInMinutes * 60*1_000_000_000).astype('datetime64[ns]')
df = df.drop(columns=['days', 'daysToAdd']).reset_index(drop=True)
解释:
days
数。durationInMinutes
列以包含源自输入行的每个结果行中的每一行分钟的列表。daysToAdd
,其中每一行的天数列表添加到源自输入行的每个结果行的 startTime。explode()
从durationInMinutes
和daysToAdd
的列表中创建具有一个值的结果行。daysToAdd
的纳秒等效值添加到startTime
。endTime
更新为startTime
加上等效于纳秒的durationInMinutes
。drop()
删除不需要的列,并使用reset_index()
获取从 0 开始的整数范围索引,每行增加 1。输入:
startTime endTime durationInMinutes
0 01-06-2010 09:00:00 02-06-2010 17:00:00 1445
1 13-02-2016 09:00:00 14-02-2016 17:00:00 1435
2 18-03-2018 09:00:00 19-03-2018 17:00:00 1440
3 23-05-2011 09:00:00 24-05-2011 17:00:00 1440
输出:
startTime endTime durationInMinutes
0 2010-06-01 09:00:00 2010-06-01 17:00:00 480
1 2010-06-02 09:00:00 2010-06-02 17:00:00 480
2 2010-06-03 09:00:00 2010-06-03 17:00:00 480
3 2010-06-04 09:00:00 2010-06-04 09:05:00 5
4 2016-02-13 09:00:00 2016-02-13 17:00:00 480
5 2016-02-14 09:00:00 2016-02-14 17:00:00 480
6 2016-02-15 09:00:00 2016-02-15 16:55:00 475
7 2018-03-18 09:00:00 2018-03-18 17:00:00 480
8 2018-03-19 09:00:00 2018-03-19 17:00:00 480
9 2018-03-20 09:00:00 2018-03-20 17:00:00 480
10 2011-05-23 09:00:00 2011-05-23 17:00:00 480
11 2011-05-24 09:00:00 2011-05-24 17:00:00 480
12 2011-05-25 09:00:00 2011-05-25 17:00:00 480
原始答案:
这是一种解决您的问题的方法:
df = pd.concat([df.assign(durationInMinutes=df.durationInMinutes/3,
orig_row=i).reset_index() for i in range(3)])
for col in ['startTime', 'endTime']:
df[col] = (pd.to_datetime(df[col], dayfirst=True).astype('int64') +
df.orig_row * 24*60*60*1_000_000_000).astype('datetime64[ns]')
df = df.sort_values('index').drop(columns=['index', 'orig_row'])
解释:
durationInMinutes
列。df
的 3 个副本,每个副本都有一个新列orig_row
,其中包含一个整数,该整数对应于副本的编号(0、1 或 2)。startTime
和endTime
中的每一个,将字符串值转换为以纳秒为单位的日期时间, orig_row
数的纳秒等值添加到其中( 24 hours * 60 minutes * 60 seconds * 1bn nanoseconds
)。输入:
startTime endTime durationInMinutes
0 01-06-2010 09:00:00 02-06-2010 17:00:00 1440
1 13-02-2016 09:00:00 14-02-2016 17:00:00 1440
2 18-03-2018 09:00:00 19-03-2018 17:00:00 1440
3 23-05-2011 09:00:00 24-05-2011 17:00:00 1440
输出:
startTime endTime durationInMinutes
0 2010-06-01 09:00:00 2010-06-02 17:00:00 480.0
0 2010-06-02 09:00:00 2010-06-03 17:00:00 480.0
0 2010-06-03 09:00:00 2010-06-04 17:00:00 480.0
1 2016-02-13 09:00:00 2016-02-14 17:00:00 480.0
1 2016-02-14 09:00:00 2016-02-15 17:00:00 480.0
1 2016-02-15 09:00:00 2016-02-16 17:00:00 480.0
2 2018-03-18 09:00:00 2018-03-19 17:00:00 480.0
2 2018-03-19 09:00:00 2018-03-20 17:00:00 480.0
2 2018-03-20 09:00:00 2018-03-21 17:00:00 480.0
3 2011-05-23 09:00:00 2011-05-24 17:00:00 480.0
3 2011-05-24 09:00:00 2011-05-25 17:00:00 480.0
3 2011-05-25 09:00:00 2011-05-26 17:00:00 480.0
故障排除:
在评论中,OP 提到获取ValueError: columns must have matching element counts
。 在我的环境中UPDATED
解决方案中的explode()
行之前的print(df)
给出:
startTime endTime durationInMinutes days daysToAdd
0 01-06-2010 09:00:00 02-06-2010 17:00:00 [480, 480, 480, 5] 4 (0, 1, 2, 3)
1 13-02-2016 09:00:00 14-02-2016 17:00:00 [480, 480, 475] 3 (0, 1, 2)
2 18-03-2018 09:00:00 19-03-2018 17:00:00 [480, 480, 480] 3 (0, 1, 2)
3 23-05-2011 09:00:00 24-05-2011 17:00:00 [480, 480, 480] 3 (0, 1, 2)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.