Most efficient way to compare two panda data frame and update one dataframe based on condition

Question

I have two dataframe df1 and df2. df2 consist of "tagname" and "value" column. Dictionary "bucket_dict" holds the data from df2.

bucket_dict = dict(zip(df2.tagname,df2.value))

In a df1 there are millions of row.3 columns are there "apptag","comments" and "Type" in df1. I want to match between this two dataframes like, if

"dictionary key" from bucket_dict contains in df1["apptag"] then update the value of df1["comments"] = corresponding dictionary key and df1["Type"] = corresponding bucket_dict["key name"] . I used below code:

for each_tag in bucket_dict: 
    df1.loc[(df1["apptag"].str.match(each_tag, case = False ,na = False)), "comments"] =  each_tag
    df1.loc[(df1["apptag"].str.match(each_tag, case = False ,na = False)), "Type"] =  bucket_dict[each_tag]

Is there any efficient way to do this since it's taking longer time.

Bucketing df from which dictionary has been created:

bucketing_df = pd.DataFrame([["pen", "study"], ["pencil", "study"], ["ersr","study"],["rice","grocery"],["wht","grocery"]], columns=['tagname', 'value'])

other dataframe:

  output_df = pd.DataFrame([["test123-pen", "pen"," "], ["test234-pencil", "pencil"," "], ["test234-rice","rice", " "], columns=['apptag', 'comments','type'])

Required output:

Answer 1

You can do this by calling an apply on your comments column along with a loc on your bucketing_df in this manner -

def find_type(a):
    try:
        return (bucketing_df.loc[[x in a for x in bucketing_df['tagname']]])['value'].values[0]
    except:
        return ""

def find_comments(a):
    try:
        return (bucketing_df.loc[[x in a for x in bucketing_df['tagname']]])['tagname'].values[0]
    except:
        return ""


output_df['type'] = output_df['apptag'].apply(lambda a: find_type(a))
output_df['comments'] = output_df['apptag'].apply(lambda a:find_comments(a))

Here I had to make them separate functions so it could handle cases where no tagname existed in apptag

It gives you this as the output_df -

           apptag comments     type
0     test123-pen      pen    study
1  test234-pencil   pencil    study
2    test234-rice     rice  grocery

All this code uses is the existing bucketing_df and output_df you provided at the end of your question.

Most efficient way to compare two panda data frame and update one dataframe based on condition

Question

1 answers

solution1
0 ACCPTED 2020-02-03 09:58:48

Most efficient way to compare two panda data frame and update one dataframe based on condition

Question

1 answers

solution1 0 ACCPTED 2020-02-03 09:58:48

solution1
0 ACCPTED 2020-02-03 09:58:48