[英]How to select top K items from 2 pandas DataFrame based on conditions?
Assume, there are two DataFrame: visitor & group .假设,有两个 DataFrame: visitor & group 。 visitor stores each visitor information and which item s/he selected (likelihood values).
访客存储每个访客信息和他/她选择的项目(可能性值)。 However, not every item has been purchased by all visitors.
但是,并非所有访客都购买了每件商品。 group stores the certain items belong to which item-family information.
组存储某些项目属于哪个项目系列信息。 The following are the toy DataFrames
以下是玩具 DataFrames
import numpy as np
import pandas as pd
items = [11,12,13,14,
21,22,23,24,
2,7,9,10]
col_names = [2,7,9,10,11,13,14,21,24]
np.random.seed(123)
nums = np.round(np.random.random(size = (3,9)),2)
visitor = pd.DataFrame(nums, index = (100,101,102))
visitor.columns = col_names
group = pd.DataFrame({'item':sorted(items),
'family':sorted(['a1','a2','a3']*4)})
print(visitor)
2 7 9 10 11 13 14 21 24
100 0.70 0.29 0.23 0.55 0.72 0.42 0.98 0.68 0.48
101 0.39 0.34 0.73 0.44 0.06 0.40 0.74 0.18 0.18
102 0.53 0.53 0.63 0.85 0.72 0.61 0.72 0.32 0.36
print(group)
item family
0 2 a1
1 7 a1
2 9 a1
3 10 a1
4 11 a2
5 12 a2
6 13 a2
7 14 a2
8 21 a3
9 22 a3
10 23 a3
11 24 a3
The goal is to select top 2 items that are from DIFFERENT item-family based on the values.目标是select基于值的来自不同项目系列的前 2 个项目。 This is my code
这是我的代码
def Basket(df, x, num_items = 2):
keys = list(df)
values = df.loc[x]
item_dict = dict([(i, j) for i, j in zip(keys, values)])
output = list(dict(sorted(item_dict.items(), key=lambda kv: kv[1], reverse = True)))[:num_items]
return output
print(Basket(df = visitor, dx = 100))
[14, 11] # 14 & 11 from the same family: a2
print(Basket(df = visitor, x = 101))
[14, 9] # 14 & 9 from different families: a2 & a1
I am not sure how to incorporate the group df into my code to select top 2 items (based on the values and item-family information) from different family such as我不确定如何将组df 合并到我的代码中,以 select 来自不同系列的前 2 个项目(基于值和项目系列信息),例如
print(Basket(df1 = visitor, df2 = group, x = 100))
[14, 2]
print(Basket(df1 = visitor, df2 = group, x = 101))
[14, 9]
Note: 100, 101, and 102 represent visitor id (row index).注: 100、101、102代表访客id(行索引)。 any suggestion?
有什么建议吗? many thanks in advance
提前谢谢了
Try:尝试:
def basket(visitor, x, number_items=2):
return (visitor.loc[[x]].T # selecting visitor id and transposing
.merge(group,
left_index=True,
right_on='item') # merging with group dataframe
.sort_values(x, ascending=False) # sorting on values in group
.groupby('family') # creating family groups
.head(1) # selecting one item from each group
.head(number_items)['item'] # Getting top n items
.to_numpy()) # return numpy array
Output: Output:
basket(visitor, 100, 2)
# array([14, 2], dtype=int64)
basket(visitor, 101, 2)
# array([14, 9], dtype=int64)
You can merge your 2 dataframes before:您可以在之前合并您的 2 个数据框:
out = visitor.rename_axis('visitor').melt(var_name='item', ignore_index=False) \
.reset_index().merge(group, on='item')
out = out.loc[out.groupby(['visitor', 'family'])['value'].nlargest(2).index.levels[-1]] \
.sort_values(['visitor', 'family', 'value'], ascending=[True, True, False], ignore_index=True)
Output: Output:
>>> out
visitor item value family
0 100 2 0.70 a1
1 100 10 0.55 a1
2 100 14 0.98 a2
3 100 11 0.72 a2
4 100 21 0.68 a3
5 100 24 0.48 a3
6 101 9 0.73 a1
7 101 10 0.44 a1
8 101 14 0.74 a2
9 101 13 0.40 a2
10 101 21 0.18 a3
11 101 24 0.18 a3
12 102 10 0.85 a1
13 102 9 0.63 a1
14 102 11 0.72 a2
15 102 14 0.72 a2
16 102 24 0.36 a3
17 102 21 0.32 a3
Intermediate result after merge
: merge
后的中间结果:
>>> out
visitor item value family
0 100 2 0.70 a1
1 101 2 0.39 a1
2 102 2 0.53 a1
3 100 7 0.29 a1
4 101 7 0.34 a1
5 102 7 0.53 a1
6 100 9 0.23 a1
7 101 9 0.73 a1
8 102 9 0.63 a1
9 100 10 0.55 a1
10 101 10 0.44 a1
11 102 10 0.85 a1
12 100 11 0.72 a2
13 101 11 0.06 a2
14 102 11 0.72 a2
15 100 13 0.42 a2
16 101 13 0.40 a2
17 102 13 0.61 a2
18 100 14 0.98 a2
19 101 14 0.74 a2
20 102 14 0.72 a2
21 100 21 0.68 a3
22 101 21 0.18 a3
23 102 21 0.32 a3
24 100 24 0.48 a3
25 101 24 0.18 a3
26 102 24 0.36 a3
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.