如何根据条件从 2 pandas DataFrame 中取出 select top K item？

Question

Assume, there are two DataFrame: visitor & group .假设，有两个 DataFrame: visitor & group 。 visitor stores each visitor information and which item s/he selected (likelihood values).访客存储每个访客信息和他/她选择的项目（可能性值）。 However, not every item has been purchased by all visitors.但是，并非所有访客都购买了每件商品。 group stores the certain items belong to which item-family information.组存储某些项目属于哪个项目系列信息。 The following are the toy DataFrames以下是玩具 DataFrames

import numpy as np 
import pandas as pd
items = [11,12,13,14, 
         21,22,23,24,
         2,7,9,10]
col_names = [2,7,9,10,11,13,14,21,24]
np.random.seed(123)
nums = np.round(np.random.random(size = (3,9)),2)

visitor = pd.DataFrame(nums, index = (100,101,102))
visitor.columns = col_names

group = pd.DataFrame({'item':sorted(items),
                      'family':sorted(['a1','a2','a3']*4)})

print(visitor)

       2     7     9     10    11    13    14    21    24
100  0.70  0.29  0.23  0.55  0.72  0.42  0.98  0.68  0.48
101  0.39  0.34  0.73  0.44  0.06  0.40  0.74  0.18  0.18
102  0.53  0.53  0.63  0.85  0.72  0.61  0.72  0.32  0.36

print(group)

    item family
0      2     a1
1      7     a1
2      9     a1
3     10     a1
4     11     a2
5     12     a2
6     13     a2
7     14     a2
8     21     a3
9     22     a3
10    23     a3
11    24     a3

The goal is to select top 2 items that are from DIFFERENT item-family based on the values.目标是select基于值的来自不同项目系列的前 2 个项目。 This is my code这是我的代码

def Basket(df, x, num_items = 2):
    keys = list(df)   
    values = df.loc[x]   
    item_dict = dict([(i, j) for i, j in zip(keys, values)])
    output = list(dict(sorted(item_dict.items(), key=lambda kv: kv[1], reverse = True)))[:num_items]
    return output

print(Basket(df = visitor, dx = 100))
[14, 11]  # 14 & 11 from the same family: a2

print(Basket(df = visitor, x = 101))
[14, 9] # 14 & 9 from different families: a2 & a1

I am not sure how to incorporate the group df into my code to select top 2 items (based on the values and item-family information) from different family such as我不确定如何将组df 合并到我的代码中，以 select 来自不同系列的前 2 个项目（基于值和项目系列信息），例如

print(Basket(df1 = visitor, df2 = group, x = 100))
[14, 2]

print(Basket(df1 = visitor, df2 = group, x = 101))
[14, 9]

Note: 100, 101, and 102 represent visitor id (row index).注： 100、101、102代表访客id（行索引）。 any suggestion?有什么建议吗？ many thanks in advance提前谢谢了

Answer 1

Try:尝试：

def basket(visitor, x, number_items=2):
    return (visitor.loc[[x]].T                     # selecting visitor id and transposing 
                  .merge(group, 
                         left_index=True, 
                         right_on='item')          # merging with group dataframe 
                  .sort_values(x, ascending=False) # sorting on values in group
                  .groupby('family')               # creating family groups
                  .head(1)                         # selecting one item from each group
                  .head(number_items)['item']      # Getting top n items
                  .to_numpy())                     # return numpy array

Output: Output：

basket(visitor, 100, 2)
# array([14,  2], dtype=int64)

basket(visitor, 101, 2)
# array([14,  9], dtype=int64)

Answer 2

You can merge your 2 dataframes before:您可以在之前合并您的 2 个数据框：

out = visitor.rename_axis('visitor').melt(var_name='item', ignore_index=False) \
             .reset_index().merge(group, on='item')

out = out.loc[out.groupby(['visitor', 'family'])['value'].nlargest(2).index.levels[-1]] \
         .sort_values(['visitor', 'family', 'value'], ascending=[True, True, False], ignore_index=True)

Output: Output：

>>> out
    visitor  item  value family
0       100     2   0.70     a1
1       100    10   0.55     a1
2       100    14   0.98     a2
3       100    11   0.72     a2
4       100    21   0.68     a3
5       100    24   0.48     a3
6       101     9   0.73     a1
7       101    10   0.44     a1
8       101    14   0.74     a2
9       101    13   0.40     a2
10      101    21   0.18     a3
11      101    24   0.18     a3
12      102    10   0.85     a1
13      102     9   0.63     a1
14      102    11   0.72     a2
15      102    14   0.72     a2
16      102    24   0.36     a3
17      102    21   0.32     a3

Intermediate result after merge : merge后的中间结果：

>>> out
    visitor  item  value family
0       100     2   0.70     a1
1       101     2   0.39     a1
2       102     2   0.53     a1
3       100     7   0.29     a1
4       101     7   0.34     a1
5       102     7   0.53     a1
6       100     9   0.23     a1
7       101     9   0.73     a1
8       102     9   0.63     a1
9       100    10   0.55     a1
10      101    10   0.44     a1
11      102    10   0.85     a1
12      100    11   0.72     a2
13      101    11   0.06     a2
14      102    11   0.72     a2
15      100    13   0.42     a2
16      101    13   0.40     a2
17      102    13   0.61     a2
18      100    14   0.98     a2
19      101    14   0.74     a2
20      102    14   0.72     a2
21      100    21   0.68     a3
22      101    21   0.18     a3
23      102    21   0.32     a3
24      100    24   0.48     a3
25      101    24   0.18     a3
26      102    24   0.36     a3

如何根据条件从 2 pandas DataFrame 中取出 select top K item？

问题描述

2 个解决方案

解决方案1
3 已采纳 2022-03-03 14:59:01

解决方案2
1 2022-03-03 14:59:09

如何根据条件从 2 pandas DataFrame 中取出 select top K item？

问题描述

2 个解决方案

解决方案1 3 已采纳 2022-03-03 14:59:01

解决方案2 1 2022-03-03 14:59:09

解决方案1
3 已采纳 2022-03-03 14:59:01

解决方案2
1 2022-03-03 14:59:09