在Python数据框中遍历groupby

Question

I am new to python. 我是python的新手。 I am trying to write the code on the python dataframe to loop through the data. 我正在尝试在python数据框上编写代码以遍历数据。 Below is my initial data: 以下是我的初始数据：

A   B   C   Start Date  End Date
1   2   5   01/01/15    1/31/15
1   2   4   02/01/15    2/28/15
1   2   7   02/25/15    3/15/15
1   2   9   03/11/15    3/30/15
1   2   8   03/14/15    4/5/15
1   2   3   03/31/15    4/10/15
1   2   4   04/05/15    4/27/15
1   2   11  04/15/15    4/20/15
4   5   23  5/6/16      6/6/16
4   5   12  6/10/16     7/10/16

I want to create a new column as forward_c. 我想创建一个新列作为forward_c。 Forward_C is the data of that row which satisfies the conditions: Forward_C是满足条件的该行的数据：

Column A and B should be equal. A和B列应相等。
Start_Date of the row should be greater than Start Date and End Date of the current Row. 该行的开始日期应大于当前行的开始日期和结束日期。

The expected output is : 预期的输出是：

A   B   C   Start Date  End Date    Forward_C
1   2   5   01/01/15    1/31/15        4
1   2   4   02/01/15    2/28/15        9
1   2   7   02/25/15    3/15/15        3
1   2   9   03/11/15    3/30/15        3
1   2   8   03/14/15    4/5/15         11
1   2   3   03/31/15    4/10/15        11
1   2   4   04/05/15    4/27/15         0
1   2   11  04/15/15    4/20/15         0
4   5   23  5/6/16      6/6/16         12
4   5   12  6/10/16     7/10/16         0

I wrote below code to achieve the same: 我在下面的代码中实现了相同的目的：

df = data.groupby(['A','B'], as_index = False).apply(lambda x: 
x.sort_values(['Start Date','End Date'],ascending = True))

for i,j in df.iterrows():

    for index,row in df.iterrows():

        if (j['A'] == row['A']) and (j['B'] == row['B']) and (row['Start Date'] > j['End Date']) and (j['Start Date'] < row['Start Date']):

            j['Forward_C'] = row['C']

            df.loc[i,'Forward_C'] = row['C']

            break

I was wondering if there is any more efficient way to do the same in python. 我想知道是否有更有效的方法在python中执行相同的操作。 Because now my code will iterate through all the rows for each record. 因为现在我的代码将遍历每个记录的所有行。 This will slow down the performance, since it will be dealing with more than 10 million records. 由于它将处理超过1000万条记录，因此这将降低性能。

Your input is appreciated. 感谢您的投入。 Thanks in advance. 提前致谢。

Regards, RD 问候，RD

Answer 1

I was not exactly clear with the question. 我不清楚这个问题。 based on my understanding, this is what i could come up with. 根据我的理解，这是我能想到的。 Iam using Cross Join instead of a loop. 我使用交叉联接而不是循环。

import pandas
data = #Actual Data Frame
data['Join'] = "CrossJoinColumn"
df1 = pandas.merge(data,data,how = "left",on = "Join",suffixes = ["","_2"])
df1 = df1[(df1['A'] == df1['A_2']) & (df1['B'] == df1['B_2']) & (df1['Start Date'] < df1['Start Date_2']) & (df1['End Date'] < df1['Start Date_2'])].groupby(by =['A','B','C','Start Date','End Date']).first().reset_index()[['A','B','C','Start Date','End Date','C_2']]
df1 = pandas.merge(data,df1,how = "left",on = ['A','B','C','Start Date','End Date'])

在Python数据框中遍历groupby

问题描述

1 个解决方案

解决方案1
0 2017-07-10 10:02:58

在Python数据框中遍历groupby

问题描述

1 个解决方案

解决方案1 0 2017-07-10 10:02:58

解决方案1
0 2017-07-10 10:02:58