[英]How to combine 2 rows based on different column values
I'm new to Python. 我是Python的新手。 I'm using pandas and I have the below data with 3 fields
Task
, Status_From
and Status_To
as a dataframe. 我正在使用熊猫,并且我具有以下带有3个字段
Task
, Status_From
和Status_To
的数据作为数据Status_To
。
If the Status_To
of the first row is same as the Status_From
of next row, then those 2 rows should be combined based on Task
. 如果
Status_To
第一行的是相同Status_From
下一行的,然后将这些2行应结合基于Task
。
+------+-------------+-----------+
| Task | Status_From | Status_To |
+------+-------------+-----------+
| AAA | 31-Aug-18 | 04-Sep-18 |
| BBB | 21-Jun-18 | 21-Jun-18 |
| BBB | 21-Jun-18 | 29-Jun-18 |
| BBB | 29-Jun-18 | 29-Jun-18 |
| CCC | 20-Aug-18 | 20-Aug-18 |
| CCC | 24-Aug-18 | 24-Aug-18 |
| CCC | 24-Aug-18 | 01-Sep-18 |
| DDD | 06-Jul-18 | 06-Jul-18 |
| EEE | 18-May-18 | 18-May-18 |
| FFF | 01-Aug-18 | 01-Aug-18 |
| GGG | 20-Apr-18 | 23-Apr-18 |
| GGG | 23-Apr-18 | 23-Apr-18 |
| HHH | 22-Jan-18 | 23-Jan-18 |
| HHH | 23-Jan-18 | 23-Jan-18 |
| HHH | 23-Jan-18 | 30-Jan-18 |
+------+-------------+-----------+
Output expected: 预期输出:
+------+-------------+-----------+
| Task | Status_From | Status_To |
+------+-------------+-----------+
| AAA | 31-Aug-18 | 04-Sep-18 |
| BBB | 21-Jun-18 | 29-Jun-18 |
| CCC | 20-Aug-18 | 20-Aug-18 |
| CCC | 24-Aug-18 | 01-Sep-18 |
| DDD | 06-Jul-18 | 06-Jul-18 |
| EEE | 18-May-18 | 18-May-18 |
| FFF | 01-Aug-18 | 01-Aug-18 |
| GGG | 20-Apr-18 | 23-Apr-18 |
| HHH | 22-Jan-18 | 30-Jan-18 |
+------+-------------+-----------+
Tried with a 'for' loop and 'if' condition. 尝试使用“ for”循环和“ if”条件。 But it didn't work.
但这没有用。 Is there a simple option to do this?
有一个简单的选择可以做到这一点吗?
Assume your data already sorted, then you can use cumsum() to setup groups, find the last Status_To
of each group and then drop_duplicates(). 假设您的数据已经排序,则可以使用cumsum()设置组,找到每个组的最后一个
Status_To
,然后找到drop_duplicates()。
df1 = df.assign(
g=df.groupby('Task').apply(lambda x: (x.Status_From != x.Status_To.shift()).cumsum()).reset_index(level=0, drop=True)
)
Output of df1 is: df1的输出是:
# Task Status_From Status_To g
#0 AAA 31-Aug-18 04-Sep-18 1
#1 BBB 21-Jun-18 21-Jun-18 1
#2 BBB 21-Jun-18 29-Jun-18 1
#3 BBB 29-Jun-18 29-Jun-18 1
#4 CCC 20-Aug-18 20-Aug-18 1
#5 CCC 24-Aug-18 24-Aug-18 2
#6 CCC 24-Aug-18 01-Sep-18 2
#7 DDD 06-Jul-18 06-Jul-18 1
#8 EEE 18-May-18 18-May-18 1
#9 FFF 01-Aug-18 01-Aug-18 1
#10 GGG 20-Apr-18 23-Apr-18 1
#11 GGG 23-Apr-18 23-Apr-18 1
#12 HHH 22-Jan-18 23-Jan-18 1
#13 HHH 23-Jan-18 23-Jan-18 1
#14 HHH 23-Jan-18 30-Jan-18 1
Then, use transform: 然后,使用transform:
df1['Status_To'] = df1.groupby(['Task', 'g']).Status_To.transform('last')
df1 = df1.drop_duplicates(['Task','g']).drop('g', axis=1)
New output will be: 新的输出将是:
# Task Status_From Status_To
#0 AAA 31-Aug-18 04-Sep-18
#1 BBB 21-Jun-18 29-Jun-18
#4 CCC 20-Aug-18 20-Aug-18
#5 CCC 24-Aug-18 01-Sep-18
#7 DDD 06-Jul-18 06-Jul-18
#8 EEE 18-May-18 18-May-18
#9 FFF 01-Aug-18 01-Aug-18
#10 GGG 20-Apr-18 23-Apr-18
#12 HHH 22-Jan-18 30-Jan-18
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.