简体   繁体   English

将用户定义的 function 应用于 pandas 中的 groupby

[英]apply a user defined function to a groupby in pandas

My dataframe df looks something like this我的 dataframe df看起来像这样

review_id   user_id  prod_id    review
0               10      5       this restaurant is the best.
1               30      10      Worst food.
2               10      15      Best place!
3               30      5       the food is too expensive.
4               30      10      Yummy! I love it.

I now defined a function ACS that I want to use to calculate the average content similarity of each user.我现在定义了一个 function ACS ,我想用它来计算每个用户的平均内容相似度。 I wrote the function as follows:我写的function如下:

def ACS(rvw1,rvw2):
    rvw1=rvw1.replace(",", "").replace(".", "").replace("?","").replace("!","").lower()
    rvw2=rvw2.replace(",", "").replace(".", "").replace("?","").replace("!","").lower()
    rvw1words = rvw1.split()
    rvw2words = rvw2.split()
    allwords = list(set(rvw1words) | set(rvw2words))
    rvw1freq=[]
    rvw2freq=[]
    for word in allwords:
        rvw1freq.append(rvw1words.count(word))
        rvw2freq.append(rvw2words.count(word))
    return np.dot(rvw1freq,rvw2freq)/(np.linalg.norm(rvw1freq)*np.linalg.norm(rvw2freq))   

This function takes two strings as an input and returns the similarity between them on a scale of 0 to 1. My aim is to calculate the content similarity of each user so I formed a groupby as follow:这个 function 将两个字符串作为输入,并以 0 到 1 的比例返回它们之间的相似度。我的目的是计算每个用户的内容相似度,所以我形成了一个 groupby,如下所示:

grouped = df.groupby('user_id')['review']

Now i want to apply my ACS function on each group (something like grouped.ACS() ).现在我想在每个组上应用我的ACS function(类似于grouped.ACS() )。 But the problem is that ACS takes two strings as input and calculate their similarity.但问题是 ACS 将两个字符串作为输入并计算它们的相似度。 But each group in the groupby may have more than 2 review strings.但是 groupby 中的每个组可能有超过 2 个评论字符串。 What should I do to apply this function to each group such that it takes all the reviews from a group and calculate their content similarity.我应该怎么做才能将此 function 应用于每个组,以便它从一个组中获取所有评论并计算它们的内容相似度。 Many Thanks.非常感谢。

You can use pd.merge to get the cartesian product of rows and then use pd.DataFrame.apply to apply your function:您可以使用pd.merge获取行的笛卡尔积,然后使用pd.DataFrame.apply应用您的 function:

import pandas as pd

# Helper function to get combinations of a dataframe and their cosine similarity
def groupSimilarity(df):
    combinations = (df.assign(dummy=1)
                     .merge(df.assign(dummy=1), on="dummy")
                     .drop("dummy", axis=1))
    similarity = combinations.apply(lambda x: ACS(x["review_x"], x["review_y"]), axis=1)
    combinations.loc[:, "similarity"] = similarity
    return combinations

# apply function to each group
grouped = (df.groupby("user_id")
            .apply(combinations)
            .reset_index())

# >>> grouped[["review_id_x", "review_id_y", "user_id_x", "user_id_y", "distance"]]
#     review_id_x  review_id_y  user_id_x  user_id_y  distance
# 0             0            0         10         10  1.000000
# 1             0            2         10         10  0.316228
# 2             2            0         10         10  0.316228
# 3             2            2         10         10  1.000000
# 4             1            1         30         30  1.000000
# 5             1            3         30         30  0.316228
# 6             1            4         30         30  0.000000
# 7             3            1         30         30  0.316228
# 8             3            3         30         30  1.000000
# 9             3            4         30         30  0.000000
# 10            4            1         30         30  0.000000
# 11            4            3         30         30  0.000000
# 12            4            4         30         30  1.000000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM