两个熊猫数据帧之间的平均向量

Question

Assume, there are two DataFrame, which are假设有两个DataFrame，分别是

import pandas as pd
import numpy as np 

df1 = pd.DataFrame({'item':['apple', 'orange', 'melon',
                            'meat', 'milk', 'soda', 'wine'],
                    'vector':[[12, 31, 45], [21, 14, 56], 
                              [9, 47, 3], [20, 7, 98], 
                              [11, 67, 5], [23, 45, 3],
                              [8, 9, 33]]})

df2 = pd.DataFrame({'customer':[1,2,3],
                    'grocery':[['apple', 'soda', 'wine'],
                               ['meat', 'orange'],
                               ['coffee', 'meat', 'milk', 'orange']]})

The outputs of df1 and df2 are df1 和 df2 的输出是

df1
    item    vector
0   apple   [12, 31, 45]
1   orange  [21, 14, 56]
2   melon   [9, 47, 3]
3   meat    [20, 7, 98]
4   milk    [11, 67, 5]
5   soda    [23, 45, 3]
6   wine    [8, 9, 33]

df2
customer    grocery
0   1   [apple, soda, wine]
1   2   [meat, orange]
2   3   [coffee, meat, milk, orange]

The goal is to average vectors of each customer's grocery list.目标是平均每个客户的购物清单的向量。 If an item does not list in the df1 then use [0, 0, 0] to represent, thus 'coffee' = [0, 0, 0] .如果 df1 中没有列出某个项目，则使用[0, 0, 0]表示，因此'coffee' = [0, 0, 0] 。 The final data frame df2 will be like最终的数据帧 df2 会像

    customer    grocery                  average
0   1   [apple, soda, wine]             [14.33, 28.33, 27]
1   2   [meat, orange]                  [20.5, 10.5, 77]
2   3   [coffee, meat, milk, orange]    [13, 22, 39.75]

where customer1 is to average the vectors of apple, soda, and wine.其中 customer1 是平均苹果、苏打水和葡萄酒的向量。 customer3 is to average vectors of coffee, meat, milk and orange, Again, here coffee = [0, 0, 0] because it is not on df1. customer3 是平均咖啡、肉、牛奶和橙子的向量，同样，这里的coffee = [0, 0, 0]因为它不在 df1 上。 Any suggestions?有什么建议么？ many thanks in advance提前谢谢了

Answer 1

This answer may be long-winded and not optimized, but it will serve your purpose.这个答案可能冗长且未优化，但它会满足您的目的。

First of all, you need to check if the items in df2 is in df1 so that you can add the non existing item into df1 along with the 0s.首先，您需要检查 df2 中的项目是否在 df1 中，以便您可以将不存在的项目与 0 一起添加到 df1 中。

import itertools

for i in set(itertools.chain.from_iterable(df2['grocery'])):
    if i not in list(df1['item']):
        df1.loc[len(df1.index)] = [i,[0,0,0]]

Next, you can perform list comprehension to find the average of the list and add it to a new column in df2.接下来，您可以执行列表推导以查找列表的平均值并将其添加到 df2 中的新列。

df2['average'] = [np.mean(list(df1.loc[df1['item'].isin(i)]["vector"]),axis=0) for i in df2["grocery"]]

df2
Out[91]: 
   customer  ...                                         average
0         1  ...  [14.333333333333334, 28.333333333333332, 27.0]
1         2  ...                              [20.5, 10.5, 77.0]
2         3  ...                             [13.0, 22.0, 39.75]

[3 rows x 3 columns]

Answer 2

Can you check if this works?你能检查这是否有效吗？ I'll add an explanation if it works.如果它有效，我会添加一个解释。

d2 = df2.explode('grocery')
df2['average'] = d2['grocery'].map(df1.set_index('item')['vector'].map(np.mean)).fillna(0).round(1).groupby(level=0).agg(list)

两个熊猫数据帧之间的平均向量

问题描述

2 个解决方案

解决方案1
1 已采纳 2022-06-10 02:12:14

解决方案2
-1 2022-06-10 01:35:35

两个熊猫数据帧之间的平均向量

问题描述

2 个解决方案

解决方案1 1 已采纳 2022-06-10 02:12:14

解决方案2 -1 2022-06-10 01:35:35

解决方案1
1 已采纳 2022-06-10 02:12:14

解决方案2
-1 2022-06-10 01:35:35