[英]apply a user defined function to a groupby in pandas
My dataframe df
looks something like this我的 dataframe df
看起来像这样
review_id user_id prod_id review
0 10 5 this restaurant is the best.
1 30 10 Worst food.
2 10 15 Best place!
3 30 5 the food is too expensive.
4 30 10 Yummy! I love it.
I now defined a function ACS
that I want to use to calculate the average content similarity of each user.我现在定义了一个 function ACS
,我想用它来计算每个用户的平均内容相似度。 I wrote the function as follows:我写的function如下:
def ACS(rvw1,rvw2):
rvw1=rvw1.replace(",", "").replace(".", "").replace("?","").replace("!","").lower()
rvw2=rvw2.replace(",", "").replace(".", "").replace("?","").replace("!","").lower()
rvw1words = rvw1.split()
rvw2words = rvw2.split()
allwords = list(set(rvw1words) | set(rvw2words))
rvw1freq=[]
rvw2freq=[]
for word in allwords:
rvw1freq.append(rvw1words.count(word))
rvw2freq.append(rvw2words.count(word))
return np.dot(rvw1freq,rvw2freq)/(np.linalg.norm(rvw1freq)*np.linalg.norm(rvw2freq))
This function takes two strings as an input and returns the similarity between them on a scale of 0 to 1. My aim is to calculate the content similarity of each user so I formed a groupby as follow:这个 function 将两个字符串作为输入,并以 0 到 1 的比例返回它们之间的相似度。我的目的是计算每个用户的内容相似度,所以我形成了一个 groupby,如下所示:
grouped = df.groupby('user_id')['review']
Now i want to apply my ACS
function on each group (something like grouped.ACS()
).现在我想在每个组上应用我的ACS
function(类似于grouped.ACS()
)。 But the problem is that ACS takes two strings as input and calculate their similarity.但问题是 ACS 将两个字符串作为输入并计算它们的相似度。 But each group in the groupby may have more than 2 review strings.但是 groupby 中的每个组可能有超过 2 个评论字符串。 What should I do to apply this function to each group such that it takes all the reviews from a group and calculate their content similarity.我应该怎么做才能将此 function 应用于每个组,以便它从一个组中获取所有评论并计算它们的内容相似度。 Many Thanks.非常感谢。
You can use pd.merge
to get the cartesian product of rows and then use pd.DataFrame.apply
to apply your function:您可以使用pd.merge
获取行的笛卡尔积,然后使用pd.DataFrame.apply
应用您的 function:
import pandas as pd
# Helper function to get combinations of a dataframe and their cosine similarity
def groupSimilarity(df):
combinations = (df.assign(dummy=1)
.merge(df.assign(dummy=1), on="dummy")
.drop("dummy", axis=1))
similarity = combinations.apply(lambda x: ACS(x["review_x"], x["review_y"]), axis=1)
combinations.loc[:, "similarity"] = similarity
return combinations
# apply function to each group
grouped = (df.groupby("user_id")
.apply(combinations)
.reset_index())
# >>> grouped[["review_id_x", "review_id_y", "user_id_x", "user_id_y", "distance"]]
# review_id_x review_id_y user_id_x user_id_y distance
# 0 0 0 10 10 1.000000
# 1 0 2 10 10 0.316228
# 2 2 0 10 10 0.316228
# 3 2 2 10 10 1.000000
# 4 1 1 30 30 1.000000
# 5 1 3 30 30 0.316228
# 6 1 4 30 30 0.000000
# 7 3 1 30 30 0.316228
# 8 3 3 30 30 1.000000
# 9 3 4 30 30 0.000000
# 10 4 1 30 30 0.000000
# 11 4 3 30 30 0.000000
# 12 4 4 30 30 1.000000
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.