如何对 pandas Dataframe 执行 groupby 操作，其中取列表列的平均值？

Question

I have a pandas DataFrame like below:我有一个 pandas DataFrame 如下所示：

df = pd.DataFrame({"A": [1, 1, 1, 2, 2],
                   "B": ["apple", "apple", "banana", "pineapple", "pineapple"],
                   "C": [[6, 5, 2], [2, 10, 2], [5, 37, 1], [4, 19, 2], [1, 5, 1]]})

Now I want to perform a groupby-operation on columns A and B , and get the average of the lists in column C .现在我想对A和B列执行分组操作，并获得C列中列表的平均值。 The average of multiple lists is defined as an element-wise average, so the average of all 1st elements in the 1st position of the list, the average of all 2nd elements in the second position of the list and so on...多个列表的平均值被定义为元素平均值，因此列表的第一个 position 中的所有第一个元素的平均值，列表的第二个 position 中的所有第二个元素的平均值等等......

The desired output for this example looks like this:此示例所需的 output 如下所示：

A        B            C
1        apple        [4, 7.5, 2]
1        banana       [5, 37, 1]
2        pineapple    [2.5, 12, 1.5]

(It is always guaranteed that the lists for each group have the same length) （始终保证每个组的列表具有相同的长度）

How to solve this?如何解决这个问题？

Usually I know how to perform groupby operations, either as list aggregations or as averages, but I could not find how to do this when comparing multiple lists.通常我知道如何执行 groupby 操作，无论是作为列表聚合还是作为平均值，但是在比较多个列表时我找不到如何执行此操作。 Should a groupby operation not be the most efficient solution, I'm also open to other suggestions.如果 groupby 操作不是最有效的解决方案，我也愿意接受其他建议。

Answer 1

Approach 1方法一

Here, we create a new dataframe from the lists contained in column C and set the index of this newly created dataframe to columns A and B .在这里，我们从 C 列中包含的列表中创建一个新的C并将这个新创建的 dataframe 的索引设置为A和B列。 Now, aggregate this frame by taking mean on levels present in the index现在，通过对索引中存在的水平取mean来聚合这个框架

Then using .values + tolist take the view of mean values as numpy array, convert this view to list and assign to the column C然后使用.values + tolist将平均值的视图视为 numpy 数组，将此视图转换为列表并分配给列C

s = df.set_index(['A', 'B'])
out = pd.DataFrame(list(s['C']), s.index).mean(level=[0, 1])
out.drop(out.columns.tolist(), 1).assign(C=out.values.tolist()).reset_index()

Approach 2方法二

Naive approach which can be slower when dealing with big dataframes.处理大数据帧时可能会变慢的幼稚方法。 Here we group the dataframe by columns A and B and apply a lambda function on column C , the lambda function then creates a numpy array from the lists and takes mean along axis=0 Here we group the dataframe by columns A and B and apply a lambda function on column C , the lambda function then creates a numpy array from the lists and takes mean along axis=0

out = df.groupby(['A', 'B'])['C'].apply(
         lambda s: np.array(list(s)).mean(axis=0)).reset_index()

Result结果

   A          B                 C
0  1      apple   [4.0, 7.5, 2.0]
1  1     banana  [5.0, 37.0, 1.0]
2  2  pineapple  [2.5, 12.0, 1.5]

Performance Profiling性能分析

On sample dataframe with 50000 rows and 30000 unique groups样本 dataframe 具有50000行和30000个唯一组

df = pd.concat([df.assign(B=df['B'] + str(i))
               for i in range(10000)], ignore_index=True)


%%timeit
s = df.set_index(['A', 'B'])
out = pd.DataFrame(list(s['C']), s.index).mean(level=[0, 1])
_ = out.drop(out.columns.tolist(), 1).assign(C=out.values.tolist()).reset_index()
# 173 ms ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


%%timeit
_ = df.groupby(['A', 'B'])['C'].apply(lambda s: np.array(list(s)).mean(axis=0)).reset_index()
# 2.24 s ± 68.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Answer 2

TRY:尝试：

df = pd.concat([df.pop('C').apply(pd.Series), df], 1).groupby(
    ['A', 'B']).mean().apply(list, 1).reset_index()

or:或者：

df = df.T.apply(pd.Series.explode).T.convert_dtypes().groupby(
    ['A', 'B']).mean().apply(list, 1).reset_index()

Answer 3

Try This尝试这个

df = df.groupby(['A','B'])['C'].agg(list).reset_index()
df['C'] = df['C'].apply(lambda x: np.mean(x, axis=0))

Output Output

    A   B         C
0   1   apple     [4.0, 7.5, 2.0]
1   1   banana    [5.0, 37.0, 1.0]
2   2   pineapple [2.5, 12.0, 1.5]

如何对 pandas Dataframe 执行 groupby 操作，其中取列表列的平均值？

问题描述

3 个解决方案

解决方案1
4 已采纳 2021-05-22 13:17:41

Approach 1方法一

Approach 2方法二

Result结果

Performance Profiling性能分析

解决方案2
2 2021-05-22 13:02:37

解决方案3
2 2021-05-22 13:11:25

如何对 pandas Dataframe 执行 groupby 操作，其中取列表列的平均值？

问题描述

3 个解决方案

解决方案1 4 已采纳 2021-05-22 13:17:41

Approach 1方法一

Approach 2方法二

Result结果

Performance Profiling性能分析

解决方案2 2 2021-05-22 13:02:37

解决方案3 2 2021-05-22 13:11:25

解决方案1
4 已采纳 2021-05-22 13:17:41

解决方案2
2 2021-05-22 13:02:37

解决方案3
2 2021-05-22 13:11:25