My dataframe df
looks something like this
review_id user_id prod_id review
0 10 5 this restaurant is the best.
1 30 10 Worst food.
2 10 15 Best place!
3 30 5 the food is too expensive.
4 30 10 Yummy! I love it.
I now defined a function ACS
that I want to use to calculate the average content similarity of each user. I wrote the function as follows:
def ACS(rvw1,rvw2):
rvw1=rvw1.replace(",", "").replace(".", "").replace("?","").replace("!","").lower()
rvw2=rvw2.replace(",", "").replace(".", "").replace("?","").replace("!","").lower()
rvw1words = rvw1.split()
rvw2words = rvw2.split()
allwords = list(set(rvw1words) | set(rvw2words))
rvw1freq=[]
rvw2freq=[]
for word in allwords:
rvw1freq.append(rvw1words.count(word))
rvw2freq.append(rvw2words.count(word))
return np.dot(rvw1freq,rvw2freq)/(np.linalg.norm(rvw1freq)*np.linalg.norm(rvw2freq))
This function takes two strings as an input and returns the similarity between them on a scale of 0 to 1. My aim is to calculate the content similarity of each user so I formed a groupby as follow:
grouped = df.groupby('user_id')['review']
Now i want to apply my ACS
function on each group (something like grouped.ACS()
). But the problem is that ACS takes two strings as input and calculate their similarity. But each group in the groupby may have more than 2 review strings. What should I do to apply this function to each group such that it takes all the reviews from a group and calculate their content similarity. Many Thanks.
You can use pd.merge
to get the cartesian product of rows and then use pd.DataFrame.apply
to apply your function:
import pandas as pd
# Helper function to get combinations of a dataframe and their cosine similarity
def groupSimilarity(df):
combinations = (df.assign(dummy=1)
.merge(df.assign(dummy=1), on="dummy")
.drop("dummy", axis=1))
similarity = combinations.apply(lambda x: ACS(x["review_x"], x["review_y"]), axis=1)
combinations.loc[:, "similarity"] = similarity
return combinations
# apply function to each group
grouped = (df.groupby("user_id")
.apply(combinations)
.reset_index())
# >>> grouped[["review_id_x", "review_id_y", "user_id_x", "user_id_y", "distance"]]
# review_id_x review_id_y user_id_x user_id_y distance
# 0 0 0 10 10 1.000000
# 1 0 2 10 10 0.316228
# 2 2 0 10 10 0.316228
# 3 2 2 10 10 1.000000
# 4 1 1 30 30 1.000000
# 5 1 3 30 30 0.316228
# 6 1 4 30 30 0.000000
# 7 3 1 30 30 0.316228
# 8 3 3 30 30 1.000000
# 9 3 4 30 30 0.000000
# 10 4 1 30 30 0.000000
# 11 4 3 30 30 0.000000
# 12 4 4 30 30 1.000000
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.