[英]Pandas: change cell values based on condition
I have the following Pandas dataframe.我有以下 Pandas dataframe。
import pandas as pd
data = {'id_a': [1, 1, 1, 2, 2, 2, 3, 4], 'name_a': ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'd'],
'id_b': [5, 6, 7, 8, 9, 10, 11, 11], 'name_b': ['e', 'f', 'g', 'h', 'i', 'j', 'k', 'k'],
'similar': [1, 1, 1, 1, 1, 0, 1, 1], 'metric': [.5, 1, .8, .7, .2, .9, .8, .9]}
df = pd.DataFrame(data)
print(df)
⠀ ⠀
id_a name_a id_b name_b similar metric
--- ------ -------- ------ -------- --------- --------
0 1 a 5 e 1 0.5
1 1 a 6 f 1 1.0
2 1 a 7 g 1 0.8
3 2 b 8 h 1 0.7
4 2 b 9 i 1 0.2
5 2 b 10 j 0 0.9
6 3 c 11 k 1 0.8
7 4 d 11 k 1 0.9
In this table, the IDs of group A are linked to the IDs of group B (based on column similar
).在此表中,组 A 的 ID 链接到组 B 的 ID(基于
similar
列)。
But I need a unique ID of each group to correspond to only one ID of another group.但是我需要每个组的唯一 ID 来对应另一个组的一个 ID。
And among the rows with the same ID of each group, I need to select the row in which the column metric
is maximum.在每个组的ID相同的行中,我需要select 列
metric
最大的行。
For example, I have three rows with id_a
== 2. Among these three rows, only two have a column similar
value equal to 1. Among these two rows, one row has a column metric
value of 0.7, and the second one has 0.2.比如我有
metric
id_a
== 2,这三行中只有两行的列similar
值等于1 .
I leave the value of column similar
= 1, only for the row with a column metric
of 0.7 (because it is maximum), and for the second row I put the value of column similar
= 0.我将列
similar
的值保留为 1,仅用于列metric
为 0.7 的行(因为它是最大值),对于第二行,我将列similar
的值 = 0。
That is, I need the following dataframe:也就是说,我需要以下dataframe:
output_data = {'id_a': [1, 1, 1, 2, 2, 2, 3, 4], 'name_a': ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'd'],
'id_b': [5, 6, 7, 8, 9, 10, 11, 11], 'name_b': ['e', 'f', 'g', 'h', 'i', 'j', 'k', 'k'],
'similar': [0, 1, 0, 1, 0, 0, 0, 1], 'metric': [.5, 1, .8, .7, .2, .9, .8, .9]}
output_df = pd.DataFrame(output_data)
print(output_df)
⠀ ⠀
id_a name_a id_b name_b similar metric
--- ------ -------- ------ -------- --------- --------
0 1 a 5 e 0 0.5
1 1 a 6 f 1 1.0
2 1 a 7 g 0 0.8
3 2 b 8 h 1 0.7
4 2 b 9 i 0 0.2
5 2 b 10 j 0 0.9
6 3 c 11 k 0 0.8
7 4 d 11 k 1 0.9
Question: How to implement this using Python (because my research did not give any results)?问题:如何使用 Python 来实现这个(因为我的研究没有给出任何结果)?
I'm not sure about how you're handling the case of id_a == 3
for instnace but I think this is what you want.我不确定您如何处理
id_a == 3
for instnace 的情况,但我认为这就是您想要的。 Just take the max index from each group (grouped by id_a
) and then, after resetting the similar
column, reset those maximal indexes to 1.只需从每个组中获取最大索引(按
id_a
分组),然后在重置similar
列后,将这些最大索引重置为 1。
max_vals = df.groupby('id_a').apply(lambda grp: grp.loc[grp['similar'] == 1, 'metric'].idxmax())
df['similar'] = 0
df.loc[max_vals, 'similar'] = 1
>>> df
id_a name_a id_b name_b similar metric
0 1 a 5 e 0 0.5
1 1 a 6 f 1 1.0
2 1 a 7 g 0 0.8
3 2 b 8 h 1 0.7
4 2 b 9 i 0 0.2
5 2 b 10 j 0 0.9
6 3 c 11 k 1 0.8
7 4 d 11 k 1 0.9
EDIT : See the comments as to why the output doesn't match exactly for row #6.编辑:请参阅关于为什么 output 与第 6 行不完全匹配的评论。
IIUC, you could do: IIUC,你可以这样做:
# find the indices of the maximum by id_a
keep_a = df[df.similar.eq(1)].groupby('id_a').filter(lambda x: len(x) > 1).groupby('id_a').metric.idxmax()
# find the indices of the maximum by id_b
keep_b = df[df.similar.eq(1)].groupby('id_b').filter(lambda x: len(x) > 1).groupby('id_b').metric.idxmax()
# create mask False if is in set of maximum
mask = ~df.index.isin(set(keep_a) | set(keep_b))
# set values using mask
df.loc[mask, 'similar'] = 0
print(df)
Output Output
id_a name_a id_b name_b similar metric
0 1 a 5 e 0 0.5
1 1 a 6 f 1 1.0
2 1 a 7 g 0 0.8
3 2 b 8 h 1 0.7
4 2 b 9 i 0 0.2
5 2 b 10 j 0 0.9
6 3 c 11 k 0 0.8
7 4 d 11 k 1 0.9
Series.mask
to transform the value of metric
into NaN
where similar == 0
so that it can never be the maximum and therefore have a 1 in the result. Series.mask
将metric
的值转换为NaN
,其中similar == 0
,因此它永远不会是最大值,因此结果中为 1。
Series.shift
+ Series.cumsum
+ Series.all
to be able to group when there are either consecutive values in id_a
or in id_b
. Series.shift
+ Series.cumsum
+ Series.all
当id_a
或id_b
中存在连续值时能够进行分组。 Keep in mind that this would be that simple for N ids.请记住,对于 N 个 ID,这将是那么简单。
create a series with the maximums by groups using groupby.transform
and compare it with the Metric
Series to obtain a Boolean series that you can convert with Series.astype
to 1
or 0
使用
groupby.transform
创建具有最大值的系列,并将其与Metric
系列进行比较以获得 Boolean 系列,您可以使用Series.astype
将其转换为1
或0
df2=df.copy()
#discarding similar == 0 as a maximum candidate in the groups
df2['metric']=df2['metric'].mask(df2['similar'].eq(0))
#creating groups depend on id_a and id_b
ids=df2[['id_a','id_b']]
groups=ids.ne(ids.shift()).all(axis=1).cumsum()
#checking the maximum per group and converting to integer
df['similar']=df['metric'].eq(df2.groupby(groups).metric.transform('max')).astype(int)
print(df)
Output Output
id_a name_a id_b name_b similar metric
0 1 a 5 e 0 0.5
1 1 a 6 f 1 1.0
2 1 a 7 g 0 0.8
3 2 b 8 h 1 0.7
4 2 b 9 i 0 0.2
5 2 b 10 j 0 0.9
6 3 c 11 k 0 0.8
7 4 d 11 k 1 0.9
Detail of groups组的详细信息
print(groups)
0 1
1 1
2 1
3 2
4 2
5 2
6 3
7 3
dtype: int64
Use groupby idxmax
, isin
and on 2 groupby's within the listcomp and passing to np.array
.在 listcomp 中使用 groupby
idxmax
、 isin
和 2 个 groupby 并传递给np.array
。 Finally, call all
and astype
on np.array
最后,在
np.array
上调用all
和astype
df1 = df[df.similar.eq(1)]
df['similar'] = np.array([df.index.isin(df1.groupby(col).metric.idxmax())
for col in ['id_a', 'id_b']]).all(0).astype(int)
Out[132]:
id_a name_a id_b name_b similar metric
0 1 a 5 e 0 0.5
1 1 a 6 f 1 1.0
2 1 a 7 g 0 0.8
3 2 b 8 h 1 0.7
4 2 b 9 i 0 0.2
5 2 b 10 j 0 0.9
6 3 c 11 k 0 0.8
7 4 d 11 k 1 0.9
A solution which uses vectorized methods only.仅使用矢量化方法的解决方案。
m1
: vector with max
values per group and similar == 1
m1
: 每组具有max
且similar == 1
m2
: rows where similar == 1
m2
: similar == 1
的行m3
: rows which have max
value & similar == 1
m3
:具有max
和similar == 1
的行m1 = df.query('similar == 1').groupby('id_a')['metric'].transform('max')
m2 = df['similar'].eq(1)
m3 = df.loc[m2, 'metric'].eq(m1)
df.loc[m3[~m3].index, 'similar'] = 0
id_a name_a id_b name_b similar metric
0 1 a 5 e 0 0.50
1 1 a 6 f 1 1.00
2 1 a 7 g 0 0.80
3 2 b 8 h 1 0.70
4 2 b 9 i 0 0.20
5 2 b 10 j 0 0.90
6 3 c 11 k 1 0.80
7 4 d 11 k 1 0.90
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.