简体   繁体   中英

apply a user defined function to a groupby in pandas

My dataframe df looks something like this

review_id   user_id  prod_id    review
0               10      5       this restaurant is the best.
1               30      10      Worst food.
2               10      15      Best place!
3               30      5       the food is too expensive.
4               30      10      Yummy! I love it.

I now defined a function ACS that I want to use to calculate the average content similarity of each user. I wrote the function as follows:

def ACS(rvw1,rvw2):
    rvw1=rvw1.replace(",", "").replace(".", "").replace("?","").replace("!","").lower()
    rvw2=rvw2.replace(",", "").replace(".", "").replace("?","").replace("!","").lower()
    rvw1words = rvw1.split()
    rvw2words = rvw2.split()
    allwords = list(set(rvw1words) | set(rvw2words))
    rvw1freq=[]
    rvw2freq=[]
    for word in allwords:
        rvw1freq.append(rvw1words.count(word))
        rvw2freq.append(rvw2words.count(word))
    return np.dot(rvw1freq,rvw2freq)/(np.linalg.norm(rvw1freq)*np.linalg.norm(rvw2freq))   

This function takes two strings as an input and returns the similarity between them on a scale of 0 to 1. My aim is to calculate the content similarity of each user so I formed a groupby as follow:

grouped = df.groupby('user_id')['review']

Now i want to apply my ACS function on each group (something like grouped.ACS() ). But the problem is that ACS takes two strings as input and calculate their similarity. But each group in the groupby may have more than 2 review strings. What should I do to apply this function to each group such that it takes all the reviews from a group and calculate their content similarity. Many Thanks.

You can use pd.merge to get the cartesian product of rows and then use pd.DataFrame.apply to apply your function:

import pandas as pd

# Helper function to get combinations of a dataframe and their cosine similarity
def groupSimilarity(df):
    combinations = (df.assign(dummy=1)
                     .merge(df.assign(dummy=1), on="dummy")
                     .drop("dummy", axis=1))
    similarity = combinations.apply(lambda x: ACS(x["review_x"], x["review_y"]), axis=1)
    combinations.loc[:, "similarity"] = similarity
    return combinations

# apply function to each group
grouped = (df.groupby("user_id")
            .apply(combinations)
            .reset_index())

# >>> grouped[["review_id_x", "review_id_y", "user_id_x", "user_id_y", "distance"]]
#     review_id_x  review_id_y  user_id_x  user_id_y  distance
# 0             0            0         10         10  1.000000
# 1             0            2         10         10  0.316228
# 2             2            0         10         10  0.316228
# 3             2            2         10         10  1.000000
# 4             1            1         30         30  1.000000
# 5             1            3         30         30  0.316228
# 6             1            4         30         30  0.000000
# 7             3            1         30         30  0.316228
# 8             3            3         30         30  1.000000
# 9             3            4         30         30  0.000000
# 10            4            1         30         30  0.000000
# 11            4            3         30         30  0.000000
# 12            4            4         30         30  1.000000

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM