简体   繁体   English

如何对 pandas Dataframe 执行 groupby 操作,其中取列表列的平均值?

[英]How to perform a groupby operation on a pandas Dataframe where the average over a list column is taken?

I have a pandas DataFrame like below:我有一个 pandas DataFrame 如下所示:

df = pd.DataFrame({"A": [1, 1, 1, 2, 2],
                   "B": ["apple", "apple", "banana", "pineapple", "pineapple"],
                   "C": [[6, 5, 2], [2, 10, 2], [5, 37, 1], [4, 19, 2], [1, 5, 1]]})

Now I want to perform a groupby-operation on columns A and B , and get the average of the lists in column C .现在我想对AB列执行分组操作,并获得C列中列表的平均值。 The average of multiple lists is defined as an element-wise average, so the average of all 1st elements in the 1st position of the list, the average of all 2nd elements in the second position of the list and so on...多个列表的平均值被定义为元素平均值,因此列表的第一个 position 中的所有第一个元素的平均值,列表的第二个 position 中的所有第二个元素的平均值等等......

The desired output for this example looks like this:此示例所需的 output 如下所示:

A        B            C
1        apple        [4, 7.5, 2]
1        banana       [5, 37, 1]
2        pineapple    [2.5, 12, 1.5]

(It is always guaranteed that the lists for each group have the same length) (始终保证每个组的列表具有相同的长度)

How to solve this?如何解决这个问题?

Usually I know how to perform groupby operations, either as list aggregations or as averages, but I could not find how to do this when comparing multiple lists.通常我知道如何执行 groupby 操作,无论是作为列表聚合还是作为平均值,但是在比较多个列表时我找不到如何执行此操作。 Should a groupby operation not be the most efficient solution, I'm also open to other suggestions.如果 groupby 操作不是最有效的解决方案,我也愿意接受其他建议。

Approach 1方法一

Here, we create a new dataframe from the lists contained in column C and set the index of this newly created dataframe to columns A and B .在这里,我们从 C 列中包含的列表中创建一个新的C并将这个新创建的 dataframe 的索引设置为AB列。 Now, aggregate this frame by taking mean on levels present in the index现在,通过对索引中存在的水平取mean来聚合这个框架

Then using .values + tolist take the view of mean values as numpy array, convert this view to list and assign to the column C然后使用.values + tolist将平均值的视图视为 numpy 数组,将此视图转换为列表并分配给列C

s = df.set_index(['A', 'B'])
out = pd.DataFrame(list(s['C']), s.index).mean(level=[0, 1])
out.drop(out.columns.tolist(), 1).assign(C=out.values.tolist()).reset_index()

Approach 2方法二

Naive approach which can be slower when dealing with big dataframes.处理大数据帧时可能会变慢的幼稚方法。 Here we group the dataframe by columns A and B and apply a lambda function on column C , the lambda function then creates a numpy array from the lists and takes mean along axis=0 Here we group the dataframe by columns A and B and apply a lambda function on column C , the lambda function then creates a numpy array from the lists and takes mean along axis=0

out = df.groupby(['A', 'B'])['C'].apply(
         lambda s: np.array(list(s)).mean(axis=0)).reset_index()

Result结果

   A          B                 C
0  1      apple   [4.0, 7.5, 2.0]
1  1     banana  [5.0, 37.0, 1.0]
2  2  pineapple  [2.5, 12.0, 1.5]

Performance Profiling性能分析

On sample dataframe with 50000 rows and 30000 unique groups样本 dataframe 具有50000行和30000个唯一组

df = pd.concat([df.assign(B=df['B'] + str(i))
               for i in range(10000)], ignore_index=True)


%%timeit
s = df.set_index(['A', 'B'])
out = pd.DataFrame(list(s['C']), s.index).mean(level=[0, 1])
_ = out.drop(out.columns.tolist(), 1).assign(C=out.values.tolist()).reset_index()
# 173 ms ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


%%timeit
_ = df.groupby(['A', 'B'])['C'].apply(lambda s: np.array(list(s)).mean(axis=0)).reset_index()
# 2.24 s ± 68.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

TRY:尝试:

df = pd.concat([df.pop('C').apply(pd.Series), df], 1).groupby(
    ['A', 'B']).mean().apply(list, 1).reset_index()

or:或者:

df = df.T.apply(pd.Series.explode).T.convert_dtypes().groupby(
    ['A', 'B']).mean().apply(list, 1).reset_index()

Try This尝试这个

df = df.groupby(['A','B'])['C'].agg(list).reset_index()
df['C'] = df['C'].apply(lambda x: np.mean(x, axis=0))

Output Output

    A   B         C
0   1   apple     [4.0, 7.5, 2.0]
1   1   banana    [5.0, 37.0, 1.0]
2   2   pineapple [2.5, 12.0, 1.5]


声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在运行时在 Pandas DataFrame 上构建语句以执行 groupby 操作? - How to build a statement to perform a groupby operation during runtime on a Pandas DataFrame? 如何在pandas中按列分组后在不同列之间执行操作? - How to perform operation between different columns after groupby a column in pandas? Pandas DataFrame Groupby 如何将组作为列表获取并获取特定列的平均值 - Pandas DataFrame Groupby How to get the group as a list and get average of particular column 如何在 python 中的多列 groupBy 上迭代 pandas dataframe - how to iterate over pandas dataframe over multiple column groupBy in python 如何对熊猫数据框中的列进行迭代和执行操作 - How to iterate & perform operation over columns in pandas dataframe 如何在 Pandas Groupby Python 中执行不同的平均值? - How to perform distinct average in Pandas Groupby in Python? groupby操作后对Pandas dataframe中的一列进行排序 - Sort a column in Pandas dataframe after groupby operation 如何在没有操作的情况下分组或聚合 Pandas dataframe - how to groupby or aggregate Pandas dataframe without an operation Pandas DataFrame用两列分组,并添加列作为移动平均值 - Pandas DataFrame Groupby two columns and add column for moving average 熊猫数据框 groupby 制作一列的列表或数组 - Pandas dataframe groupby make a list or array of a column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM