[英]How to perform a groupby operation on a pandas Dataframe where the average over a list column is taken?
I have a pandas DataFrame like below:我有一个 pandas DataFrame 如下所示:
df = pd.DataFrame({"A": [1, 1, 1, 2, 2],
"B": ["apple", "apple", "banana", "pineapple", "pineapple"],
"C": [[6, 5, 2], [2, 10, 2], [5, 37, 1], [4, 19, 2], [1, 5, 1]]})
Now I want to perform a groupby-operation on columns A
and B
, and get the average of the lists in column C
.现在我想对A
和B
列执行分组操作,并获得C
列中列表的平均值。 The average of multiple lists is defined as an element-wise average, so the average of all 1st elements in the 1st position of the list, the average of all 2nd elements in the second position of the list and so on...多个列表的平均值被定义为元素平均值,因此列表的第一个 position 中的所有第一个元素的平均值,列表的第二个 position 中的所有第二个元素的平均值等等......
The desired output for this example looks like this:此示例所需的 output 如下所示:
A B C
1 apple [4, 7.5, 2]
1 banana [5, 37, 1]
2 pineapple [2.5, 12, 1.5]
(It is always guaranteed that the lists for each group have the same length) (始终保证每个组的列表具有相同的长度)
How to solve this?如何解决这个问题?
Usually I know how to perform groupby operations, either as list aggregations or as averages, but I could not find how to do this when comparing multiple lists.通常我知道如何执行 groupby 操作,无论是作为列表聚合还是作为平均值,但是在比较多个列表时我找不到如何执行此操作。 Should a groupby operation not be the most efficient solution, I'm also open to other suggestions.如果 groupby 操作不是最有效的解决方案,我也愿意接受其他建议。
Here, we create a new dataframe from the lists contained in column C
and set the index of this newly created dataframe to columns A
and B
.在这里,我们从 C 列中包含的列表中创建一个新的C
并将这个新创建的 dataframe 的索引设置为A
和B
列。 Now, aggregate this frame by taking mean
on levels present in the index现在,通过对索引中存在的水平取mean
来聚合这个框架
Then using .values
+ tolist
take the view of mean values as numpy array, convert this view to list and assign to the column C
然后使用.values
+ tolist
将平均值的视图视为 numpy 数组,将此视图转换为列表并分配给列C
s = df.set_index(['A', 'B'])
out = pd.DataFrame(list(s['C']), s.index).mean(level=[0, 1])
out.drop(out.columns.tolist(), 1).assign(C=out.values.tolist()).reset_index()
Naive approach which can be slower when dealing with big dataframes.处理大数据帧时可能会变慢的幼稚方法。 Here we group the dataframe by columns A
and B
and apply
a lambda function on column C
, the lambda function then creates a numpy array from the lists and takes mean along axis=0
Here we group the dataframe by columns A
and B
and apply
a lambda function on column C
, the lambda function then creates a numpy array from the lists and takes mean along axis=0
out = df.groupby(['A', 'B'])['C'].apply(
lambda s: np.array(list(s)).mean(axis=0)).reset_index()
A B C
0 1 apple [4.0, 7.5, 2.0]
1 1 banana [5.0, 37.0, 1.0]
2 2 pineapple [2.5, 12.0, 1.5]
On sample dataframe with 50000 rows and 30000 unique groups样本 dataframe 具有50000行和30000个唯一组
df = pd.concat([df.assign(B=df['B'] + str(i))
for i in range(10000)], ignore_index=True)
%%timeit
s = df.set_index(['A', 'B'])
out = pd.DataFrame(list(s['C']), s.index).mean(level=[0, 1])
_ = out.drop(out.columns.tolist(), 1).assign(C=out.values.tolist()).reset_index()
# 173 ms ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
_ = df.groupby(['A', 'B'])['C'].apply(lambda s: np.array(list(s)).mean(axis=0)).reset_index()
# 2.24 s ± 68.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
TRY:尝试:
df = pd.concat([df.pop('C').apply(pd.Series), df], 1).groupby(
['A', 'B']).mean().apply(list, 1).reset_index()
or:或者:
df = df.T.apply(pd.Series.explode).T.convert_dtypes().groupby(
['A', 'B']).mean().apply(list, 1).reset_index()
Try This尝试这个
df = df.groupby(['A','B'])['C'].agg(list).reset_index()
df['C'] = df['C'].apply(lambda x: np.mean(x, axis=0))
Output Output
A B C
0 1 apple [4.0, 7.5, 2.0]
1 1 banana [5.0, 37.0, 1.0]
2 2 pineapple [2.5, 12.0, 1.5]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.