简体   繁体   English

如何根据不同的列值合并2行

[英]How to combine 2 rows based on different column values

I'm new to Python. 我是Python的新手。 I'm using pandas and I have the below data with 3 fields Task , Status_From and Status_To as a dataframe. 我正在使用熊猫,并且我具有以下带有3个字段TaskStatus_FromStatus_To的数据作为数据Status_To

If the Status_To of the first row is same as the Status_From of next row, then those 2 rows should be combined based on Task . 如果Status_To第一行的是相同Status_From下一行的,然后将这些2行应结合基于Task

+------+-------------+-----------+
| Task | Status_From | Status_To |
+------+-------------+-----------+
| AAA  | 31-Aug-18   | 04-Sep-18 |
| BBB  | 21-Jun-18   | 21-Jun-18 |
| BBB  | 21-Jun-18   | 29-Jun-18 |
| BBB  | 29-Jun-18   | 29-Jun-18 |
| CCC  | 20-Aug-18   | 20-Aug-18 |
| CCC  | 24-Aug-18   | 24-Aug-18 |
| CCC  | 24-Aug-18   | 01-Sep-18 |
| DDD  | 06-Jul-18   | 06-Jul-18 |
| EEE  | 18-May-18   | 18-May-18 |
| FFF  | 01-Aug-18   | 01-Aug-18 |
| GGG  | 20-Apr-18   | 23-Apr-18 |
| GGG  | 23-Apr-18   | 23-Apr-18 |
| HHH  | 22-Jan-18   | 23-Jan-18 |
| HHH  | 23-Jan-18   | 23-Jan-18 |
| HHH  | 23-Jan-18   | 30-Jan-18 |
+------+-------------+-----------+

Output expected: 预期输出:

+------+-------------+-----------+
| Task | Status_From | Status_To |
+------+-------------+-----------+
| AAA  | 31-Aug-18   | 04-Sep-18 |
| BBB  | 21-Jun-18   | 29-Jun-18 |
| CCC  | 20-Aug-18   | 20-Aug-18 |
| CCC  | 24-Aug-18   | 01-Sep-18 |
| DDD  | 06-Jul-18   | 06-Jul-18 |
| EEE  | 18-May-18   | 18-May-18 |
| FFF  | 01-Aug-18   | 01-Aug-18 |
| GGG  | 20-Apr-18   | 23-Apr-18 |
| HHH  | 22-Jan-18   | 30-Jan-18 |
+------+-------------+-----------+

Tried with a 'for' loop and 'if' condition. 尝试使用“ for”循环和“ if”条件。 But it didn't work. 但这没有用。 Is there a simple option to do this? 有一个简单的选择可以做到这一点吗?

Assume your data already sorted, then you can use cumsum() to setup groups, find the last Status_To of each group and then drop_duplicates(). 假设您的数据已经排序,则可以使用cumsum()设置组,找到每个组的最后一个Status_To ,然后找到drop_duplicates()。

df1 = df.assign(
    g=df.groupby('Task').apply(lambda x: (x.Status_From != x.Status_To.shift()).cumsum()).reset_index(level=0, drop=True)
)

Output of df1 is: df1的输出是:

#   Task Status_From  Status_To  g
#0   AAA   31-Aug-18  04-Sep-18  1
#1   BBB   21-Jun-18  21-Jun-18  1
#2   BBB   21-Jun-18  29-Jun-18  1
#3   BBB   29-Jun-18  29-Jun-18  1
#4   CCC   20-Aug-18  20-Aug-18  1
#5   CCC   24-Aug-18  24-Aug-18  2
#6   CCC   24-Aug-18  01-Sep-18  2
#7   DDD   06-Jul-18  06-Jul-18  1
#8   EEE   18-May-18  18-May-18  1
#9   FFF   01-Aug-18  01-Aug-18  1
#10  GGG   20-Apr-18  23-Apr-18  1
#11  GGG   23-Apr-18  23-Apr-18  1
#12  HHH   22-Jan-18  23-Jan-18  1
#13  HHH   23-Jan-18  23-Jan-18  1
#14  HHH   23-Jan-18  30-Jan-18  1

Then, use transform: 然后,使用transform:

df1['Status_To'] = df1.groupby(['Task', 'g']).Status_To.transform('last')
df1 = df1.drop_duplicates(['Task','g']).drop('g', axis=1)

New output will be: 新的输出将是:

#   Task Status_From  Status_To
#0   AAA   31-Aug-18  04-Sep-18
#1   BBB   21-Jun-18  29-Jun-18
#4   CCC   20-Aug-18  20-Aug-18
#5   CCC   24-Aug-18  01-Sep-18
#7   DDD   06-Jul-18  06-Jul-18
#8   EEE   18-May-18  18-May-18
#9   FFF   01-Aug-18  01-Aug-18
#10  GGG   20-Apr-18  23-Apr-18
#12  HHH   22-Jan-18  30-Jan-18

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM