简体   繁体   English

Pandas:加入两个数据框并应用过滤器

[英]Pandas: Join two data frame and apply filter

I have 2 dataframes.我有 2 个数据框。

Dataframe 1 Dataframe 1

Userid | SessionID | Endtime
John   | '' | 0910
Paul   | '' | 0920
.....

Dataframe 2 Dataframe 2

UserID| SessionID | starttime|end time
John | 0 | 0905 | 0915
Jack | 1 | 0900 | 0915
....

Dataframe 1 has 333975 rows. Dataframe 1 有 333975 行。 Dataframe 2 has 2460 rows. Dataframe 2 有 2460 行。

I want to label dataframe 2 with reference to dataframe 1. The match is if user in dateframe 1 = user dataframe 2, and if "endtime" falls between "starttime" and "end time", copy the SessionID from dataframe 1 to dataframe 2. I want to label dataframe 2 with reference to dataframe 1. The match is if user in dateframe 1 = user dataframe 2, and if "endtime" falls between "starttime" and "end time", copy the SessionID from dataframe 1 to dataframe 2 .

My code goes like this:我的代码是这样的:

For i in range(len(df1)) :
    For j in range(len(df2)) :
        if(df1['Userid'][1] == df2['UserID']) :
            if((df1['Endtime'] [i] > df2['starttime'][j]) & (df1['Endtime'] [i] < df2['end time'][j])) 
                df1['SessionID' ][i] = df2['SessionID'][j]

Previously when I processed 65k of d1, it takes 30 mins to complete.以前当我处理 65k 的 d1 时,需要 30 分钟才能完成。 Now with 333k it takes hours.现在有 333k 需要几个小时。

Is there a more efficient way to do this kind of labelling?有没有更有效的方法来做这种标签?

Update: I have also tried using np.where to do this but it is also taking a long time.更新:我也尝试过使用 np.where 来执行此操作,但这也需要很长时间。 It has ran 2 hours and still counting.它已经运行了 2 个小时,仍在计数。

Here's my code:这是我的代码:

df1['SessionID' ][i] = np.where( (df1['Userid'][1] == df2['UserID']) &  (df1['Endtime'] [i] > df2['starttime'][j]) & (df1['Endtime'] [i] < df2['end time'][j]), df2['SessionID'][j], df1['SessionID' ][i]) 

You can merge two data frames and apply a filter on top of it.您可以合并两个数据框并在其上应用过滤器。

raw_data = {
    'user_id': ['John', 'Paul'],
    'session_id': [1, 2],
    'end_time' : [910, 920]
}
pd_a = pd.DataFrame(
    raw_data, columns=['user_id', 'session_id', 'end_time']
)

raw_data = {
    'user_id': ['John', 'Paul'],
    'session_id': [1, 2],
    'start_time': [900, 900],
    'end_time' : [915, 925]
}
pd_b = pd.DataFrame(
    raw_data, columns=['user_id', 'session_id', 'start_time', 'end_time']
)

final_pd = pd.merge(pd_a, pd_b, on='user_id')

Output Output

  user_id  session_id_x  end_time_x  session_id_y  start_time  end_time_y
0    John             1         910             1         900         915
1    Paul             2         920             2         900         925

then, finally apply any filter you want to.然后,最后应用您想要的任何过滤器。

final_pd[final_pd['end_time_x']<=final_pd['end_time_y']]

You can try managing the objects of the second 'if' statement as Pandas Series or lists and then if the condition is satisfied you can perform the labelling on thedataset.您可以尝试将第二个“if”语句的对象管理为 Pandas 系列或列表,然后如果满足条件,您可以对数据集执行标记。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM