简体   繁体   English

在Python数据框中遍历groupby

[英]looping through groupby in Python dataframe

I am new to python. 我是python的新手。 I am trying to write the code on the python dataframe to loop through the data. 我正在尝试在python数据框上编写代码以遍历数据。 Below is my initial data: 以下是我的初始数据:

A   B   C   Start Date  End Date
1   2   5   01/01/15    1/31/15
1   2   4   02/01/15    2/28/15
1   2   7   02/25/15    3/15/15
1   2   9   03/11/15    3/30/15
1   2   8   03/14/15    4/5/15
1   2   3   03/31/15    4/10/15
1   2   4   04/05/15    4/27/15
1   2   11  04/15/15    4/20/15
4   5   23  5/6/16      6/6/16
4   5   12  6/10/16     7/10/16

I want to create a new column as forward_c. 我想创建一个新列作为forward_c。 Forward_C is the data of that row which satisfies the conditions: Forward_C是满足条件的该行的数据:

  1. Column A and B should be equal. A和B列应相等。
  2. Start_Date of the row should be greater than Start Date and End Date of the current Row. 该行的开始日期应大于当前行的开始日期和结束日期。

The expected output is : 预期的输出是:

A   B   C   Start Date  End Date    Forward_C
1   2   5   01/01/15    1/31/15        4
1   2   4   02/01/15    2/28/15        9
1   2   7   02/25/15    3/15/15        3
1   2   9   03/11/15    3/30/15        3
1   2   8   03/14/15    4/5/15         11
1   2   3   03/31/15    4/10/15        11
1   2   4   04/05/15    4/27/15         0
1   2   11  04/15/15    4/20/15         0
4   5   23  5/6/16      6/6/16         12
4   5   12  6/10/16     7/10/16         0

I wrote below code to achieve the same: 我在下面的代码中实现了相同的目的:

df = data.groupby(['A','B'], as_index = False).apply(lambda x: 
x.sort_values(['Start Date','End Date'],ascending = True))

for i,j in df.iterrows():

    for index,row in df.iterrows():

        if (j['A'] == row['A']) and (j['B'] == row['B']) and (row['Start Date'] > j['End Date']) and (j['Start Date'] < row['Start Date']):

            j['Forward_C'] = row['C']

            df.loc[i,'Forward_C'] = row['C']

            break

I was wondering if there is any more efficient way to do the same in python. 我想知道是否有更有效的方法在python中执行相同的操作。 Because now my code will iterate through all the rows for each record. 因为现在我的代码将遍历每个记录的所有行。 This will slow down the performance, since it will be dealing with more than 10 million records. 由于它将处理超过1000万条记录,因此这将降低性能。

Your input is appreciated. 感谢您的投入。 Thanks in advance. 提前致谢。

Regards, RD 问候,RD

I was not exactly clear with the question. 我不清楚这个问题。 based on my understanding, this is what i could come up with. 根据我的理解,这是我能想到的。 Iam using Cross Join instead of a loop. 我使用交叉联接而不是循环。

import pandas
data = #Actual Data Frame
data['Join'] = "CrossJoinColumn"
df1 = pandas.merge(data,data,how = "left",on = "Join",suffixes = ["","_2"])
df1 = df1[(df1['A'] == df1['A_2']) & (df1['B'] == df1['B_2']) & (df1['Start Date'] < df1['Start Date_2']) & (df1['End Date'] < df1['Start Date_2'])].groupby(by =['A','B','C','Start Date','End Date']).first().reset_index()[['A','B','C','Start Date','End Date','C_2']]
df1 = pandas.merge(data,df1,how = "left",on = ['A','B','C','Start Date','End Date'])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM