评估列表相似性

Question

I have a dataframe that contains columns of various item recommendations, and the elements are represented as a list (in reality all lists have 10 elements in them, but this is less important):我有一个数据框，其中包含各种项目推荐的列，元素表示为一个列表（实际上所有列表中都有 10 个元素，但这并不重要）：

user_id     actual            predicted            popular             random
u1          [a,b,c]           [a,b,d]              [c,e,f]             [d,e,f]
u2          [a,b,d]           [a,b,c]              [c,e,f]             [a,b,c]
u3          [c,e,f]           [a,c,e]              [c,e,f]             [a,c,f]
u4          [c,e,f]           [a,e,f]              [c,e,f]             [a,d,f]
u5          [b,e,f]           [a,b,e]              [c,e,f]             [a,c,e]

While I have some separate statistics regarding predicted , I would like to compare how close to the actual lists these predicted , popular and random lists are.虽然我有一些关于predicted单独统计数据，但我想比较这些predicted 、 popular和随机lists与actual列表的接近程度。 popular has always the same three items. popular总是相同的三个项目。

I was thinking of calculating the percentages for each case and then averaging:我正在考虑计算每个案例的百分比，然后求平均值：

 user_id                predicted            popular             random
 u1                     0.66                 0.33                0
 u2                     0.66                 0                   0.33
 u3                     0.66                 1                   0.66
 u4                     0.66                 1                   0.33
 u5                     0.66                 0.33                0.33

Normally, I would do something like:通常，我会这样做：

setA = set(listA)
setB = set(listB)

overlap = setA & setB
universe = setA | setB

result = float(len(overlap)) / len(setA) * 100

But how can I do this for a large dataframe?但是我该如何为大型数据框执行此操作？

Answer 1

So, given the following dataframe:因此，给定以下数据框：

import pandas as pd

df = pd.DataFrame(
    {
        "user_id": {0: "u1", 1: "u2", 2: "u3", 3: "u4", 4: "u5"},
        "actual": {
            0: ["a", "b", "c"],
            1: ["a", "b", "d"],
            2: ["c", "e", "f"],
            3: ["c", "e", "f"],
            4: ["b", "e", "f"],
        },
        "predicted": {
            0: ["a", "b", "d"],
            1: ["a", "b", "c"],
            2: ["a", "c", "e"],
            3: ["a", "e", "f"],
            4: ["a", "b", "e"],
        },
        "popular": {
            0: ["c", "e", "f"],
            1: ["c", "e", "f"],
            2: ["c", "e", "f"],
            3: ["c", "e", "f"],
            4: ["c", "e", "f"],
        },
        "random": {
            0: ["d", "e", "f"],
            1: ["a", "b", "c"],
            2: ["a", "c", "f"],
            3: ["a", "d", "f"],
            4: ["a", "c", "e"],
        },
    }
)

You could try this:你可以试试这个：

# Convert lists into sets
df = df.applymap(lambda x: set(x) if isinstance(x, list) else x)

# Iterate to create new columns with percentages
for i in range(df.shape[0]):
    for col in ["predicted", "popular", "random"]:
        df.loc[i, f"{col}_pct"] = (
            len(df.loc[i, "actual"] & df.loc[i, col]) / len(df.loc[i, "actual"]) * 100
        )

# Cleanup
df = df[["user_id", "predicted_pct", "popular_pct", "random_pct"]]

And here is the expected result:这是预期的结果：

print(df)
# Outputs
  user_id  predicted_pct  popular_pct  random_pct
0      u1      66.666667    33.333333    0.000000
1      u2      66.666667     0.000000   66.666667
2      u3      66.666667   100.000000   66.666667
3      u4      66.666667   100.000000   33.333333
4      u5      66.666667    66.666667   33.333333

评估列表相似性

问题描述

1 个解决方案

解决方案1
1 2021-11-20 16:37:10

评估列表相似性

问题描述

1 个解决方案

解决方案1 1 2021-11-20 16:37:10

解决方案1
1 2021-11-20 16:37:10