简体   繁体   English

评估列表相似性

[英]Evaluating list similarities

I have a dataframe that contains columns of various item recommendations, and the elements are represented as a list (in reality all lists have 10 elements in them, but this is less important):我有一个数据框,其中包含各种项目推荐的列,元素表示为一个列表(实际上所有列表中都有 10 个元素,但这并不重要):

user_id     actual            predicted            popular             random
u1          [a,b,c]           [a,b,d]              [c,e,f]             [d,e,f]
u2          [a,b,d]           [a,b,c]              [c,e,f]             [a,b,c]
u3          [c,e,f]           [a,c,e]              [c,e,f]             [a,c,f]
u4          [c,e,f]           [a,e,f]              [c,e,f]             [a,d,f]
u5          [b,e,f]           [a,b,e]              [c,e,f]             [a,c,e]  

While I have some separate statistics regarding predicted , I would like to compare how close to the actual lists these predicted , popular and random lists are.虽然我有一些关于predicted单独统计数据,但我想比较这些predictedpopular和随机listsactual列表的接近程度。 popular has always the same three items. popular总是相同的三个项目。

I was thinking of calculating the percentages for each case and then averaging:我正在考虑计算每个案例的百分比,然后求平均值:

 user_id                predicted            popular             random
 u1                     0.66                 0.33                0
 u2                     0.66                 0                   0.33
 u3                     0.66                 1                   0.66
 u4                     0.66                 1                   0.33
 u5                     0.66                 0.33                0.33

Normally, I would do something like:通常,我会这样做:

setA = set(listA)
setB = set(listB)

overlap = setA & setB
universe = setA | setB

result = float(len(overlap)) / len(setA) * 100

But how can I do this for a large dataframe?但是我该如何为大型数据框执行此操作?

So, given the following dataframe:因此,给定以下数据框:

import pandas as pd

df = pd.DataFrame(
    {
        "user_id": {0: "u1", 1: "u2", 2: "u3", 3: "u4", 4: "u5"},
        "actual": {
            0: ["a", "b", "c"],
            1: ["a", "b", "d"],
            2: ["c", "e", "f"],
            3: ["c", "e", "f"],
            4: ["b", "e", "f"],
        },
        "predicted": {
            0: ["a", "b", "d"],
            1: ["a", "b", "c"],
            2: ["a", "c", "e"],
            3: ["a", "e", "f"],
            4: ["a", "b", "e"],
        },
        "popular": {
            0: ["c", "e", "f"],
            1: ["c", "e", "f"],
            2: ["c", "e", "f"],
            3: ["c", "e", "f"],
            4: ["c", "e", "f"],
        },
        "random": {
            0: ["d", "e", "f"],
            1: ["a", "b", "c"],
            2: ["a", "c", "f"],
            3: ["a", "d", "f"],
            4: ["a", "c", "e"],
        },
    }
)

You could try this:你可以试试这个:

# Convert lists into sets
df = df.applymap(lambda x: set(x) if isinstance(x, list) else x)

# Iterate to create new columns with percentages
for i in range(df.shape[0]):
    for col in ["predicted", "popular", "random"]:
        df.loc[i, f"{col}_pct"] = (
            len(df.loc[i, "actual"] & df.loc[i, col]) / len(df.loc[i, "actual"]) * 100
        )

# Cleanup
df = df[["user_id", "predicted_pct", "popular_pct", "random_pct"]]

And here is the expected result:这是预期的结果:

print(df)
# Outputs
  user_id  predicted_pct  popular_pct  random_pct
0      u1      66.666667    33.333333    0.000000
1      u2      66.666667     0.000000   66.666667
2      u3      66.666667   100.000000   66.666667
3      u4      66.666667   100.000000   33.333333
4      u5      66.666667    66.666667   33.333333

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM