Pandas：根据条件更改单元格值

Question

I have the following Pandas dataframe.我有以下 Pandas dataframe。

import pandas as pd

data = {'id_a': [1, 1, 1, 2, 2, 2, 3, 4], 'name_a': ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'd'], 
        'id_b': [5, 6, 7, 8, 9, 10, 11, 11], 'name_b': ['e', 'f', 'g', 'h', 'i', 'j', 'k', 'k'], 
        'similar': [1, 1, 1, 1, 1, 0, 1, 1], 'metric': [.5, 1, .8, .7, .2, .9, .8, .9]}
df = pd.DataFrame(data)
print(df)

⠀ ⠀

      id_a   name_a   id_b   name_b   similar   metric  
 --- ------ -------- ------ -------- --------- -------- 
  0    1       a       5       e         1       0.5    
  1    1       a       6       f         1       1.0    
  2    1       a       7       g         1       0.8    
  3    2       b       8       h         1       0.7    
  4    2       b       9       i         1       0.2    
  5    2       b       10      j         0       0.9    
  6    3       c       11      k         1       0.8    
  7    4       d       11      k         1       0.9

In this table, the IDs of group A are linked to the IDs of group B (based on column similar ).在此表中，组 A 的 ID 链接到组 B 的 ID（基于similar列）。

But I need a unique ID of each group to correspond to only one ID of another group.但是我需要每个组的唯一 ID 来对应另一个组的一个 ID。

And among the rows with the same ID of each group, I need to select the row in which the column metric is maximum.在每个组的ID相同的行中，我需要select 列metric最大的行。

For example, I have three rows with id_a == 2. Among these three rows, only two have a column similar value equal to 1. Among these two rows, one row has a column metric value of 0.7, and the second one has 0.2.比如我有metric id_a == 2，这三行中只有两行的列similar值等于1 .

I leave the value of column similar = 1, only for the row with a column metric of 0.7 (because it is maximum), and for the second row I put the value of column similar = 0.我将列similar的值保留为 1，仅用于列metric为 0.7 的行（因为它是最大值），对于第二行，我将列similar的值 = 0。

That is, I need the following dataframe:也就是说，我需要以下dataframe：

output_data = {'id_a': [1, 1, 1, 2, 2, 2, 3, 4], 'name_a': ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'd'], 
               'id_b': [5, 6, 7, 8, 9, 10, 11, 11], 'name_b': ['e', 'f', 'g', 'h', 'i', 'j', 'k', 'k'], 
               'similar': [0, 1, 0, 1, 0, 0, 0, 1], 'metric': [.5, 1, .8, .7, .2, .9, .8, .9]}
output_df = pd.DataFrame(output_data)
print(output_df)

⠀ ⠀

      id_a   name_a   id_b   name_b   similar   metric  
 --- ------ -------- ------ -------- --------- -------- 
  0    1       a       5       e         0       0.5    
  1    1       a       6       f         1       1.0    
  2    1       a       7       g         0       0.8    
  3    2       b       8       h         1       0.7    
  4    2       b       9       i         0       0.2    
  5    2       b       10      j         0       0.9    
  6    3       c       11      k         0       0.8    
  7    4       d       11      k         1       0.9

Question: How to implement this using Python (because my research did not give any results)?问题：如何使用 Python 来实现这个（因为我的研究没有给出任何结果）？

Answer 1

I'm not sure about how you're handling the case of id_a == 3 for instnace but I think this is what you want.我不确定您如何处理id_a == 3 for instnace 的情况，但我认为这就是您想要的。 Just take the max index from each group (grouped by id_a ) and then, after resetting the similar column, reset those maximal indexes to 1.只需从每个组中获取最大索引（按id_a分组），然后在重置similar列后，将这些最大索引重置为 1。

max_vals = df.groupby('id_a').apply(lambda grp: grp.loc[grp['similar'] == 1, 'metric'].idxmax())
df['similar'] = 0
df.loc[max_vals, 'similar'] = 1

>>> df

    id_a    name_a  id_b    name_b  similar metric
0   1       a       5       e       0       0.5
1   1       a       6       f       1       1.0
2   1       a       7       g       0       0.8
3   2       b       8       h       1       0.7
4   2       b       9       i       0       0.2
5   2       b       10      j       0       0.9
6   3       c       11      k       1       0.8
7   4       d       11      k       1       0.9

EDIT : See the comments as to why the output doesn't match exactly for row #6.编辑：请参阅关于为什么 output 与第 6 行不完全匹配的评论。

Answer 2

IIUC, you could do: IIUC，你可以这样做：

# find the indices of the maximum by id_a
keep_a = df[df.similar.eq(1)].groupby('id_a').filter(lambda x: len(x) > 1).groupby('id_a').metric.idxmax()

# find the indices of the maximum by id_b
keep_b = df[df.similar.eq(1)].groupby('id_b').filter(lambda x: len(x) > 1).groupby('id_b').metric.idxmax()

# create mask False if is in set of maximum
mask = ~df.index.isin(set(keep_a) | set(keep_b))

# set values using mask
df.loc[mask, 'similar'] = 0

print(df)

Output Output

   id_a name_a  id_b name_b  similar  metric
0     1      a     5      e        0     0.5
1     1      a     6      f        1     1.0
2     1      a     7      g        0     0.8
3     2      b     8      h        1     0.7
4     2      b     9      i        0     0.2
5     2      b    10      j        0     0.9
6     3      c    11      k        0     0.8
7     4      d    11      k        1     0.9

Answer 3

Here is a clear symmetrical, orderly and fast way to do this task.这是完成这项任务的一种清晰对称、有序且快速的方法。

Series.mask to transform the value of metric into NaN where similar == 0 so that it can never be the maximum and therefore have a 1 in the result. Series.mask将metric的值转换为NaN ，其中similar == 0 ，因此它永远不会是最大值，因此结果中为 1。
Series.shift + Series.cumsum + Series.all to be able to group when there are either consecutive values in id_a or in id_b . Series.shift + Series.cumsum + Series.all当id_a或id_b中存在连续值时能够进行分组。 Keep in mind that this would be that simple for N ids.请记住，对于 N 个 ID，这将是那么简单。
create a series with the maximums by groups using groupby.transform and compare it with the Metric Series to obtain a Boolean series that you can convert with Series.astype to 1 or 0使用groupby.transform创建具有最大值的系列，并将其与Metric系列进行比较以获得 Boolean 系列，您可以使用Series.astype将其转换为1或0

df2=df.copy()
#discarding similar == 0 as a maximum candidate in the groups
df2['metric']=df2['metric'].mask(df2['similar'].eq(0))

#creating groups depend on id_a and id_b
ids=df2[['id_a','id_b']]
groups=ids.ne(ids.shift()).all(axis=1).cumsum()

#checking the maximum per group and converting to integer
df['similar']=df['metric'].eq(df2.groupby(groups).metric.transform('max')).astype(int)
print(df)

Output Output

   id_a name_a  id_b name_b  similar  metric
0     1      a     5      e        0     0.5
1     1      a     6      f        1     1.0
2     1      a     7      g        0     0.8
3     2      b     8      h        1     0.7
4     2      b     9      i        0     0.2
5     2      b    10      j        0     0.9
6     3      c    11      k        0     0.8
7     4      d    11      k        1     0.9

Detail of groups组的详细信息

print(groups)
0    1
1    1
2    1
3    2
4    2
5    2
6    3
7    3
dtype: int64

Answer 4

Use groupby idxmax , isin and on 2 groupby's within the listcomp and passing to np.array .在 listcomp 中使用 groupby idxmax 、 isin和 2 个 groupby 并传递给np.array 。 Finally, call all and astype on np.array最后，在np.array上调用all和astype

df1 = df[df.similar.eq(1)]
df['similar'] = np.array([df.index.isin(df1.groupby(col).metric.idxmax()) 
                            for col in ['id_a', 'id_b']]).all(0).astype(int)


Out[132]:
   id_a name_a  id_b name_b  similar  metric
0     1      a     5      e        0     0.5
1     1      a     6      f        1     1.0
2     1      a     7      g        0     0.8
3     2      b     8      h        1     0.7
4     2      b     9      i        0     0.2
5     2      b    10      j        0     0.9
6     3      c    11      k        0     0.8
7     4      d    11      k        1     0.9

Answer 5

A solution which uses vectorized methods only.仅使用矢量化方法的解决方案。

m1 : vector with max values per group and similar == 1 m1 : 每组具有max且similar == 1
m2 : rows where similar == 1 m2 : similar == 1的行
m3 : rows which have max value & similar == 1 m3 ：具有max和similar == 1的行

m1 = df.query('similar == 1').groupby('id_a')['metric'].transform('max')
m2 = df['similar'].eq(1)
m3 = df.loc[m2, 'metric'].eq(m1)

df.loc[m3[~m3].index, 'similar'] = 0

   id_a name_a  id_b name_b  similar  metric
0     1      a     5      e        0    0.50
1     1      a     6      f        1    1.00
2     1      a     7      g        0    0.80
3     2      b     8      h        1    0.70
4     2      b     9      i        0    0.20
5     2      b    10      j        0    0.90
6     3      c    11      k        1    0.80
7     4      d    11      k        1    0.90

Pandas：根据条件更改单元格值

问题描述

5 个解决方案

解决方案1
2 2019-11-19 20:44:47

解决方案2
2 2019-11-19 21:04:47

解决方案3
2 2019-11-19 21:05:27

Here is a clear symmetrical, orderly and fast way to do this task.这是完成这项任务的一种清晰对称、有序且快速的方法。

解决方案4
2 已采纳 2019-11-19 21:07:30

解决方案5
1 2019-11-19 20:51:58

Pandas：根据条件更改单元格值

问题描述

5 个解决方案

解决方案1 2 2019-11-19 20:44:47

解决方案2 2 2019-11-19 21:04:47

解决方案3 2 2019-11-19 21:05:27

Here is a clear symmetrical, orderly and fast way to do this task.这是完成这项任务的一种清晰对称、有序且快速的方法。

解决方案4 2 已采纳 2019-11-19 21:07:30

解决方案5 1 2019-11-19 20:51:58

解决方案1
2 2019-11-19 20:44:47

解决方案2
2 2019-11-19 21:04:47

解决方案3
2 2019-11-19 21:05:27

解决方案4
2 已采纳 2019-11-19 21:07:30

解决方案5
1 2019-11-19 20:51:58