[英]How to assign a value to each group by a groupby query on another dataframe?
Let the following dfs:让以下dfs:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({
"k1": [1, 1, 2, 2, 3, 3, 4, 4, 4],
})
df2 = pd.DataFrame({
"k2": [1, 1, 2, 2, 3, 4, 4],
"v2": np.random.rand(7)
})
print(df1)
print("_______")
print(df2)
print("_______")
out:出去:
k1
0 1
1 1
2 2
3 2
4 3
5 3
6 4
7 4
8 4
_______
k2 v2
0 1 0.260026
1 1 0.474951
2 2 0.695962
3 2 0.158575
4 3 0.396015
5 4 0.740344
6 4 0.293410
_______
I want to create a new column for df1
such that for every key k1
, a corresponding value will be applied such that if k1
== k2
, the value will be a function (say max) of v2
of the group in df2
whose key is k2
( k1
).我想为
df1
创建一个新列,以便对于每个键k1
,将应用相应的值,这样如果k1
== k2
,则该值将是df2
中组的v2
的 function (例如最大值),其键是k2
( k1
)。
Required output for above case:上述案例所需的 output:
k1 result
0 1 0.474951
1 1 0.474951
2 2 0.695962
3 2 0.695962
4 3 0.396015
5 3 0.396015
6 4 0.740344
7 4 0.740344
8 4 0.740344
It can be assumed that all keys present in k1
are also in k2
.可以假设
k1
中存在的所有键也在k2
中。
This is probably done with two groupby operations, one for query and one for assignment, but I can't figure out how to tie together the output of one to the input of the other.这可能是通过两个 groupby 操作完成的,一个用于查询,一个用于分配,但我不知道如何将一个的 output 与另一个的输入联系在一起。
Edit:编辑:
Please notice the example k1
and k2
are sorted for clarity, but are not guaranteed to be.请注意示例
k1
和k2
为清楚起见进行了排序,但不保证如此。 I also don't want to sort because of o(nlogn)
time, and this can be done in o(n)
由于
o(nlogn)
时间,我也不想排序,这可以在o(n)
中完成
We can try map
and groupby
我们可以试试
map
和groupby
df1['result'] = df1['k1'].map(df2.groupby('k2')['v2'].max())
k1 result
0 1 0.474951
1 1 0.474951
2 2 0.695962
3 2 0.695962
4 3 0.396015
5 3 0.396015
6 4 0.740344
7 4 0.740344
8 4 0.740344
First, you can sort on k2
and v2
columns in df2
to ensure that the bigger value in column v2
stay on first.首先,您可以对
df2
中的k2
和v2
列进行排序,以确保列v2
中较大的值首先保留。 Then drop duplicates on k2
to keep the first which is the max.然后在
k2
上删除重复项以保留第一个是最大值。 At last, map
v2
column in k2
to df1
.最后,从
k2
到df1
中的map
v2
列。
df1['result'] = df1['k1'].map(df2.sort_values(['k2', 'v2'], ascending=[True, False]).drop_duplicates('k2', keep='first').set_index('k2')['v2'])
print(df1)
k1 result
0 1 0.303764
1 1 0.303764
2 2 0.026024
3 2 0.026024
4 3 0.213834
5 3 0.213834
6 4 0.757031
7 4 0.757031
8 4 0.757031
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.