[英]PYTHON - PANDAS - Groupby update row value
I have a pandas df which looks like that (I duplicated each row) : 我有一个pandas df看起来像那样(我复制了每一行):
START END
0 2018-03-02 23:56:02 2018-03-03 01:25:50
1 2018-03-03 23:44:10 2018-03-04 03:03:05
2 2018-02-05 21:57:06 2018-02-06 08:25:19
3 2018-02-06 19:30:00 2018-02-07 09:04:13
4 2018-02-07 21:51:07 2018-02-08 08:13:34
0 2018-03-02 23:56:02 2018-03-03 01:25:50
1 2018-03-03 23:44:10 2018-03-04 03:03:05
2 2018-02-05 21:57:06 2018-02-06 08:25:19
3 2018-02-06 19:30:00 2018-02-07 09:04:13
4 2018-02-07 21:51:07 2018-02-08 08:13:34
I'd like tu update rows to look like that : 我想更新行看起来像这样:
START END
0 2018-03-02 23:56:02 **2018-03-02 23:59:59**
1 2018-03-03 23:44:10 **2018-03-03 23:59:59**
2 2018-02-05 21:57:06 **2018-02-05 23:59:59**
3 2018-02-06 19:30:00 **2018-02-06 23:59:59**
4 2018-02-07 21:51:07 **2018-02-07 23:59:59**
0 **2018-03-03 00:00:00** 2018-03-03 01:25:50
1 **2018-03-04 00:00:00** 2018-03-04 03:03:05
2 **2018-02-06 00:00:00** 2018-02-06 08:25:19
3 **2018-02-07 00:00:00** 2018-02-07 09:04:13
4 **2018-02-08 00:00:00** 2018-02-08 08:13:34
I tried to use groupby with head or tail but it doesn't work : 我尝试使用groupby的头部或尾部,但它不起作用:
df.loc[df.groupby(df.index).head(1).index, 'END'] = df.START.replace(hour=23, minute=59, second=59)
df.loc[df.groupby(df.index).tail(1).index, 'START'] = df.END.replace(hour=0, minute=0, second=0)
I think I'm missing something.Thanks for you help. 我想我错过了什么。谢谢你的帮助。
print (df)
START END
0 2018-03-02 23:56:02 2018-03-03 01:25:50
1 2018-03-03 23:44:10 2018-03-04 03:03:05
2 2018-02-05 21:57:06 2018-02-06 08:25:19
3 2018-02-06 19:30:00 2018-02-07 09:04:13
4 2018-02-07 21:51:07 2018-02-08 08:13:34
First use dt.floor
for set start and end dates: 首先使用
dt.floor
设置开始和结束日期:
df1, df2 = df.copy(), df.copy()
df1['END'] = df1.START.dt.floor('d') + pd.Timedelta(1, unit='d') - pd.Timedelta(1, unit='s')
df2['START'] = df2.END.dt.floor('d')
df = pd.concat([df1,df2], ignore_index=True)
print (df)
START END
0 2018-03-02 23:56:02 2018-03-02 23:59:59
1 2018-03-03 23:44:10 2018-03-03 23:59:59
2 2018-02-05 21:57:06 2018-02-05 23:59:59
3 2018-02-06 19:30:00 2018-02-06 23:59:59
4 2018-02-07 21:51:07 2018-02-07 23:59:59
5 2018-03-03 00:00:00 2018-03-03 01:25:50
6 2018-03-04 00:00:00 2018-03-04 03:03:05
7 2018-02-06 00:00:00 2018-02-06 08:25:19
8 2018-02-07 00:00:00 2018-02-07 09:04:13
9 2018-02-08 00:00:00 2018-02-08 08:13:34
Instead floor
is possible use slowier apply + replace
: 相反
floor
可能使用较慢的apply + replace
:
df1['END'] = df1.START.apply(lambda x: x.replace(hour=23, minute=59, second=59))
df2['START'] = df2.END.apply(lambda x: x.replace(hour=0, minute=0, second=0))
Timings : 时间 :
df = pd.concat([df] * 10000, ignore_index=True)
In [242]: %%timeit
...: df1, df2 = df.copy(), df.copy()
...: df1['END'] = df1.START.dt.floor('d') + pd.Timedelta(1, unit='d') - pd.Timedelta(1, unit='s')
...: df2['START'] = df2.END.dt.floor('d')
...:
100 loops, best of 3: 19.1 ms per loop
In [243]: %%timeit
...: df1, df2 = df.copy(), df.copy()
...: df1['END'] = df1.START.apply(lambda x: x.replace(hour=23, minute=59, second=59))
...: df2['START'] = df2.END.apply(lambda x: x.replace(hour=0, minute=0, second=0))
...:
1 loop, best of 3: 534 ms per loop
Trying to formulate what you want to do: 试着制定你想做的事情:
For each row that is duplicate, 对于每个重复的行,
* create 1 row with the begin time (and replace end time) *用开始时间创建1行(并替换结束时间)
* create 1 row with the end time (and replace start time) *使用结束时间创建1行(并替换开始时间)
Maybe it helps to use the duplicated function? 也许它有助于使用重复的功能?
df[df.duplicated(keep='first')]
should return the first half where you can then replace endtime, likewise you use 应该返回上半部分,然后你可以替换endtime,同样你也可以使用
df[df.duplicated(keep='last')]
for the other half. 对于另一半。
You can read more about the function here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html 您可以在此处阅读有关此功能的更多信息: https : //pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.