Python 3.x - Pandas apply is very slow

Question

I have created a recommender system. There are 2 dataframes – input_df and recommended_df

input_df – Dataframe of content already viewed by users. This df is used for generating the recommendations

User_Name   Viewed_Content_Name
User1   Content1
User1   Content2
User1   Content5
User2   Content1
User2   Content3
User2   Content5
User2   Content6
User2   Content8

Recommended_df – Dataframe of content recommended to users

User_Name   Recommended_Content_Name
User1   Content1 # This recommendation has already been viewed by User1. Hence this recommendation should be removed
User1   Content8
User2   Content2
User2   Content7

I want to remove recommendations if they have already been viewed by the user. I have tried following two approaches, but both of them are very time consuming. I need an approach which will identify occurrence of row in input_df and recommended_df

Approach 1 - Using subsetting, for each row in recommended_df, I try to see if that row has already occurred in input_df

for i in range(len(recommended_df)):
    recommended_df.loc[i,'Recommendation_Completed']=len(input_df [(input_df ['User_Name']== recommended_df.loc[i,'User_Name']) & (input_df ['Viewed_Content_Name']== recommended_df.loc[i,'Recommended_Content_Name'])]) 

recommended_df = recommended_df.loc[recommended_df['Recommendation_Completed']==0]
# Remove row if already occured in input_df

Approach 2 - Try to see if the row in recommended_df occurs in input_df using apply

Created a key column in input_df and recommended_df. This is unique key for each user and content

Input_df =

User_Name   Viewed_Content_Name    keycol (User_Name + Viewed_Content_Name)
User1   Content1    User1Content1   
User1   Content2    User1Content2
User1   Content5    User1Content5
User2   Content1    User2Content1
User2   Content3    User2Content3
User2   Content5    User2Content5
User2   Content6    User2Content6
User2   Content8    User2Content8

Recommended_df =

User_Name   Recommended_Content_Name    keycol (User_Name + Recommended_Content_Name)
User1   Content1    User1Content1
User1   Content8    User1Content8
User2   Content2    User2Content2
User2   Content7    User2Content7

recommended_df ['Recommendation_Completed'] = recommended_df ['keycol'].apply(lambda d: d in input_df ['keycol'].values)

recommended_df = recommended_df.loc[recommended_df['Recommendation_Completed']==False]
# Remove if row occurs in input_df

The second approach using apply is faster than approach 1, but i can still do the same thing faster in excel if i use the countifs function. How can I replicate it faster using python?

Answer 1

Try to only use apply as a last resort. You can concatenate user and content and then use boolean selection.

user_content_seen = input_df.User_Name + input_df.Viewed_Content_Name

user_all = Recommended_df.User_Name + Recommended_df.Recommended_Content_Name

Recommended_df[~user_all.isin(user_content_seen)]

Python 3.x - Pandas apply is very slow

Question

1 answers

solution1
2 ACCPTED 2016-12-20 13:24:12

Python 3.x - Pandas apply is very slow

Question

1 answers

solution1 2 ACCPTED 2016-12-20 13:24:12

solution1
2 ACCPTED 2016-12-20 13:24:12