[英]Evaluating list similarities
I have a dataframe that contains columns of various item recommendations, and the elements are represented as a list (in reality all lists have 10 elements in them, but this is less important):我有一个数据框,其中包含各种项目推荐的列,元素表示为一个列表(实际上所有列表中都有 10 个元素,但这并不重要):
user_id actual predicted popular random
u1 [a,b,c] [a,b,d] [c,e,f] [d,e,f]
u2 [a,b,d] [a,b,c] [c,e,f] [a,b,c]
u3 [c,e,f] [a,c,e] [c,e,f] [a,c,f]
u4 [c,e,f] [a,e,f] [c,e,f] [a,d,f]
u5 [b,e,f] [a,b,e] [c,e,f] [a,c,e]
While I have some separate statistics regarding predicted
, I would like to compare how close to the actual
lists these predicted
, popular
and random lists
are.虽然我有一些关于predicted
单独统计数据,但我想比较这些predicted
、 popular
和随机lists
与actual
列表的接近程度。 popular
has always the same three items. popular
总是相同的三个项目。
I was thinking of calculating the percentages for each case and then averaging:我正在考虑计算每个案例的百分比,然后求平均值:
user_id predicted popular random
u1 0.66 0.33 0
u2 0.66 0 0.33
u3 0.66 1 0.66
u4 0.66 1 0.33
u5 0.66 0.33 0.33
Normally, I would do something like:通常,我会这样做:
setA = set(listA)
setB = set(listB)
overlap = setA & setB
universe = setA | setB
result = float(len(overlap)) / len(setA) * 100
But how can I do this for a large dataframe?但是我该如何为大型数据框执行此操作?
So, given the following dataframe:因此,给定以下数据框:
import pandas as pd
df = pd.DataFrame(
{
"user_id": {0: "u1", 1: "u2", 2: "u3", 3: "u4", 4: "u5"},
"actual": {
0: ["a", "b", "c"],
1: ["a", "b", "d"],
2: ["c", "e", "f"],
3: ["c", "e", "f"],
4: ["b", "e", "f"],
},
"predicted": {
0: ["a", "b", "d"],
1: ["a", "b", "c"],
2: ["a", "c", "e"],
3: ["a", "e", "f"],
4: ["a", "b", "e"],
},
"popular": {
0: ["c", "e", "f"],
1: ["c", "e", "f"],
2: ["c", "e", "f"],
3: ["c", "e", "f"],
4: ["c", "e", "f"],
},
"random": {
0: ["d", "e", "f"],
1: ["a", "b", "c"],
2: ["a", "c", "f"],
3: ["a", "d", "f"],
4: ["a", "c", "e"],
},
}
)
You could try this:你可以试试这个:
# Convert lists into sets
df = df.applymap(lambda x: set(x) if isinstance(x, list) else x)
# Iterate to create new columns with percentages
for i in range(df.shape[0]):
for col in ["predicted", "popular", "random"]:
df.loc[i, f"{col}_pct"] = (
len(df.loc[i, "actual"] & df.loc[i, col]) / len(df.loc[i, "actual"]) * 100
)
# Cleanup
df = df[["user_id", "predicted_pct", "popular_pct", "random_pct"]]
And here is the expected result:这是预期的结果:
print(df)
# Outputs
user_id predicted_pct popular_pct random_pct
0 u1 66.666667 33.333333 0.000000
1 u2 66.666667 0.000000 66.666667
2 u3 66.666667 100.000000 66.666667
3 u4 66.666667 100.000000 33.333333
4 u5 66.666667 66.666667 33.333333
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.