根据搜索关键词推荐

Question

I have a input query table in the following:我在下面有一个input query表：

    query
0  orange
1   apple
2    meat

which I want to make against the user query table as following我想针对user query表进行如下操作

   user       query
0    a1      orange
1    a1  strawberry
2    a1        pear
3    a2      orange
4    a2  strawberry
5    a2       lemon
6    a3      orange
7    a3      banana
8    a6        meat
9    a7        beer
10   a8       juice

Given a query in input query , I want to match it to query by other user in user query table, and return the top 3 ranked by total number of counts.在input query中给定一个查询，我想将其与user query表中其他用户的查询相匹配，并返回按计数总数排名前 3 位。

For example, orange in input query , it matches user a1 , a2 , a3 in user query where all have queried orange , other items they have query are strawberry (count of 2), pear , lemon , banana (count of 1).例如， input query中的orange匹配用户user query中的用户a1 、 a2 、 a3都查询过orange ，他们查询的其他项目是strawberry (count of 2), pear , lemon , banana (count of 1)。

The answer will be strawberry （since it has max count), pear , lemon (since we only return top 3).答案将是strawberry （因为它有最大数量）、 pear 、 lemon （因为我们只返回前 3 名）。

Similar reasoning for apple (no user query therefore output 'nothing') and meat query. apple的类似推理（没有用户查询因此 output 'nothing'）和meat查询。

So the final output table is所以最终的output table是

    query   recommend
0  orange  strawberry
1  orange        pear
2  orange       lemon
3   apple     nothing
4    meat     nothing

What's the efficient way to do that given user query have 1 million rows?给定user query有 100 万行的有效方法是什么？

here's the code for input query , user query and output table这是input query 、 user query和output table的代码

df_input = pd.DataFrame( {'query': {0: 'orange', 1: 'apple', 2: 'meat'}} )
df_user = pd.DataFrame( {'user': {0: 'a1', 1: 'a1', 2: 'a1', 3: 'a2', 4: 'a2', 5: 'a2', 6: 'a3', 7: 'a3', 8: 'a6', 9: 'a7', 10: 'a8'}, 'query': {0: 'orange', 1: 'strawberry', 2: 'pear', 3: 'orange', 4: 'strawberry', 5: 'lemon', 6: 'orange', 7: 'banana', 8: 'meat', 9: 'beer', 10: 'juice'}} )
df_output = pd.DataFrame( {'query': {0: 'orange', 1: 'orange', 2: 'orange', 3: 'apple', 4: 'meat'}, 'recommend': {0: 'strawberry', 1: 'pear', 2: 'lemon', 3: 'nothing', 4: 'nothing'}} )

Answer 1

Depending on a memory resource you have, choose either of the following solutions.根据您拥有的 memory 资源，选择以下任一解决方案。

Code:代码：

# Preparation: ＃准备：

import pandas as pd

# Create sample dataframes
df_input = pd.DataFrame({'query': {0: 'orange', 1: 'apple', 2: 'meat'}})
df_user = pd.DataFrame({'user': {0: 'a1', 1: 'a1', 2: 'a1', 3: 'a2', 4: 'a2', 5: 'a2', 6: 'a3', 7: 'a3', 8: 'a6', 9: 'a7', 10: 'a8'}, 'query': {0: 'orange', 1: 'strawberry', 2: 'pear', 3: 'orange', 4: 'strawberry', 5: 'lemon', 6: 'orange', 7: 'banana', 8: 'meat', 9: 'beer', 10: 'juice'}})

# Define how many recommended items you need for each query
n_top = 3

# Exclude unnecessary rows for caluculation
dfu = df_user.drop_duplicates()
queries = df_input['query']
users = dfu.loc[dfu['query'].isin(queries), 'user'].drop_duplicates()
mask_q = dfu['query'].isin(queries)
mask_u = dfu['user'].isin(users)
df1 = dfu[mask_u&mask_q].set_index('user')
df2 = dfu[mask_u].set_index('user')

# Solution 1: # 解决方案 1：

If you have a large memory resource, try the following code.如果你有一个大的 memory 资源，试试下面的代码。

# Carry out the basket analysis
df = df1.join(df2, lsuffix='_x', rsuffix='_y')
df = df[df.query_x!=df.query_y].reset_index()
df = df.groupby(['query_x', 'query_y'], as_index=False).count()
df = df.sort_values('user', ascending=False).groupby('query_x').head(n_top)
df = df.drop('user', axis=1).rename(columns={'query_x': 'query', 'query_y': 'recommend'})
df = df_input.merge(df, how='left', on='query').fillna('nothing')

# Solution 2: # 解决方案 2：

If you have a limitation of a memory resource, try the following code.如果您有 memory 资源的限制，请尝试以下代码。 It takes much longer than the solution 1, but you can complete the calculation almost certainly.它比解决方案 1 花费的时间长得多，但您几乎可以肯定地完成计算。

# Carry out the basket analysis
df = pd.DataFrame()
for _, df_q1 in df1.groupby('query'):
    _df = pd.DataFrame()
    for _, df_q2 in df2.groupby('query'):
        df_q1q2 = df_q1.join(df_q2, lsuffix='_x', rsuffix='_y')
        df_q1q2 = df_q1q2.reset_index().groupby(['query_x', 'query_y'], as_index=False).count()
        _df = _df.append(df_q1q2)
    _df = _df[_df.query_x!=_df.query_y]
    _df = _df.sort_values('user', ascending=False).groupby('query_x').head(n_top)
    df = df.append(_df)
df = df.drop('user', axis=1).rename(columns={'query_x': 'query', 'query_y': 'recommend'})
df = df_input.merge(df, how='left', on='query').fillna('nothing')

Output of both solution 1 and 2:解决方案1和2的Output：

	query询问	recommend推荐
0 0	orange橘子	strawberry草莓
1 1个	orange橘子	banana香蕉
2 2个	orange橘子	lemon柠檬
3 3个	apple苹果	nothing没有
4 4个	meat肉	nothing没有

根据搜索关键词推荐

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-02-15 00:56:18

Code:代码：

# Preparation: ＃准备：

# Solution 1: # 解决方案 1：

# Solution 2: # 解决方案 2：

Output of both solution 1 and 2:解决方案1和2的Output：

根据搜索关键词推荐

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-02-15 00:56:18

Code:代码：

# Preparation: ＃ 准备：

# Solution 1: # 解决方案 1：

# Solution 2: # 解决方案 2：

Output of both solution 1 and 2:解决方案1和2的Output：

解决方案1
1 已采纳 2022-02-15 00:56:18

# Preparation: ＃准备：