如何通过对另一个 dataframe 的 groupby 查询为每个组分配一个值？

Question

Let the following dfs:让以下dfs：

import numpy as np
import pandas as pd

df1 = pd.DataFrame({
    "k1": [1, 1, 2, 2, 3, 3, 4, 4, 4],
})

df2 = pd.DataFrame({
    "k2": [1, 1, 2, 2, 3, 4, 4],
    "v2": np.random.rand(7)
})

print(df1)
print("_______")
print(df2)
print("_______")

out:出去：

   k1
0   1
1   1
2   2
3   2
4   3
5   3
6   4
7   4
8   4
_______
   k2        v2
0   1  0.260026
1   1  0.474951
2   2  0.695962
3   2  0.158575
4   3  0.396015
5   4  0.740344
6   4  0.293410
_______

I want to create a new column for df1 such that for every key k1 , a corresponding value will be applied such that if k1 == k2 , the value will be a function (say max) of v2 of the group in df2 whose key is k2 ( k1 ).我想为df1创建一个新列，以便对于每个键k1 ，将应用相应的值，这样如果k1 == k2 ，则该值将是df2中组的v2的 function （例如最大值），其键是k2 ( k1 )。

Required output for above case:上述案例所需的 output：

   k1  result
0   1  0.474951
1   1  0.474951
2   2  0.695962
3   2  0.695962
4   3  0.396015
5   3  0.396015
6   4  0.740344
7   4  0.740344
8   4  0.740344

It can be assumed that all keys present in k1 are also in k2 .可以假设k1中存在的所有键也在k2中。

This is probably done with two groupby operations, one for query and one for assignment, but I can't figure out how to tie together the output of one to the input of the other.这可能是通过两个 groupby 操作完成的，一个用于查询，一个用于分配，但我不知道如何将一个的 output 与另一个的输入联系在一起。

Edit:编辑：
Please notice the example k1 and k2 are sorted for clarity, but are not guaranteed to be.请注意示例k1和k2为清楚起见进行了排序，但不保证如此。 I also don't want to sort because of o(nlogn) time, and this can be done in o(n)由于o(nlogn)时间，我也不想排序，这可以在o(n)中完成

Answer 1

We can try map and groupby我们可以试试map和groupby

df1['result'] = df1['k1'].map(df2.groupby('k2')['v2'].max())

   k1    result
0   1  0.474951
1   1  0.474951
2   2  0.695962
3   2  0.695962
4   3  0.396015
5   3  0.396015
6   4  0.740344
7   4  0.740344
8   4  0.740344

Answer 2

First, you can sort on k2 and v2 columns in df2 to ensure that the bigger value in column v2 stay on first.首先，您可以对df2中的k2和v2列进行排序，以确保列v2中较大的值首先保留。 Then drop duplicates on k2 to keep the first which is the max.然后在k2上删除重复项以保留第一个是最大值。 At last, map v2 column in k2 to df1 .最后，从k2到df1中的map v2列。

df1['result'] = df1['k1'].map(df2.sort_values(['k2', 'v2'], ascending=[True, False]).drop_duplicates('k2', keep='first').set_index('k2')['v2'])

print(df1)

   k1        result
0   1  0.303764
1   1  0.303764
2   2  0.026024
3   2  0.026024
4   3  0.213834
5   3  0.213834
6   4  0.757031
7   4  0.757031
8   4  0.757031

如何通过对另一个 dataframe 的 groupby 查询为每个组分配一个值？

问题描述

2 个解决方案

解决方案1
3 已采纳 2021-04-27 16:10:07

解决方案2
1 2021-04-27 16:09:38

如何通过对另一个 dataframe 的 groupby 查询为每个组分配一个值？

问题描述

2 个解决方案

解决方案1 3 已采纳 2021-04-27 16:10:07

解决方案2 1 2021-04-27 16:09:38

解决方案1
3 已采纳 2021-04-27 16:10:07

解决方案2
1 2021-04-27 16:09:38