在 Pandas dataframe 中找到最小值并在新列上添加 label

Question

What improvements can I make to my python pandas code to make it more efficient?我可以对我的 python pandas 代码进行哪些改进以提高效率？ For my case, I have this dataframe就我而言，我有这个 dataframe

In [1]: df = pd.DataFrame({'PersonID': [1, 1, 1, 2, 2, 2, 3, 3, 3],
                           'Name': ["Jan", "Jan", "Jan", "Don", "Don", "Don", "Joe", "Joe", "Joe"],
                           'Label': ["REL", "REL", "REL", "REL", "REL", "REL", "REL", "REL", "REL"],
                           'RuleID': [55, 55, 55, 3, 3, 3, 10, 10, 10],
                           'RuleNumber': [3, 4, 5, 1, 2, 3, 234, 567, 999]})

Which gives this result:这给出了这个结果：

In [2]: df
Out[2]: 
   PersonID Name Label  RuleID  RuleNumber
0         1  Jan   REL      55          3
1         1  Jan   REL      55          4
2         1  Jan   REL      55          5
3         2  Don   REL       3          1
4         2  Don   REL       3          2
5         2  Don   REL       3          3
6         3  Joe   REL      10        234
7         3  Joe   REL      10        567
8         3  Joe   REL      10        999

What I need to accomplished here is to update the fields under the Label column to MAIN for the lowest rule value associated with each Rule ID that is applied to a Person ID and Name.我需要在这里完成的是将 Label 列下的字段更新为 MAIN，以获取与应用于人员 ID 和名称的每个规则 ID 关联的最低规则值。 Therefore, the results need to look like this:因此，结果需要如下所示：

In [3]: df
Out[3]:
   PersonID Name Label  RuleID  RuleNumber
0         1  Jan  MAIN      55           3
1         1  Jan   REL      55           4
2         1  Jan   REL      55           5
3         2  Don  MAIN       3           1
4         2  Don   REL       3           2
5         2  Don   REL       3           3
6         3  Joe  MAIN      10         234
7         3  Joe   REL      10         567
8         3  Joe   REL      10         999

This is the code that I wrote to accomplish this:这是我为实现此目的而编写的代码：

In [4]:

df['Label'] = np.where(
        df['RuleNumber'] ==
        df.groupby(['PersonID', 'Name', 'RuleID'])['RuleNumber'].transform('min'),
        "MAIN", df.Label)

Is there a better way to update the values under the Label column?有没有更好的方法来更新 Label 列下的值？ I feel like I'm brute forcing my way through and this may not be the most efficient way to do this.我觉得我是蛮横的，这可能不是最有效的方法。

I used the following SO threads to arrive at my result:我使用以下 SO 线程得出我的结果：

Replace column values within a groupby and condition 替换 groupby 和条件中的列值

Replace values within a groupby based on multiple conditions 根据多个条件替换 groupby 中的值

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.idxmin.html https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.idxmin.html

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transform.html https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transform.html

Using Pandas to Find Minimum Values of Grouped Rows 使用 Pandas 查找分组行的最小值

Any advice would be appreciated.任何意见，将不胜感激。

Thank you.谢谢你。

Answer 1

It seems like you can filter by the grouped idxmin regardless of sorted order and update RuleNumber based on that.似乎您可以按分组的idxmin进行过滤，而不管排序顺序如何，并以此为基础更新RuleNumber 。 You can use loc , np.where , mask , or where as follows:您可以使用loc 、 np.where 、 mask或where ，如下所示：

df.loc[df.groupby(['PersonID', 'Name', 'RuleID'])['RuleNumber'].idxmin(), 'Label'] = 'MAIN'

OR with np.where as you were trying:或与np.where一起尝试：

df['Label'] = (np.where((df.index == df.groupby(['PersonID', 'Name', 'RuleID'])
                         ['RuleNumber'].transform('idxmin')), 'MAIN', 'REL'))
df
Out[1]: 
   PersonID Name Label  RuleID  RuleNumber
0         1  Jan  MAIN      55           3
1         1  Jan   REL      55           4
2         1  Jan   REL      55           5
3         2  Don  MAIN       3           1
4         2  Don   REL       3           2
5         2  Don   REL       3           3
6         3  Joe  MAIN      10         234
7         3  Joe   REL      10         567
8         3  Joe   REL      10         999

Using mask or its inverse where would also work:使用mask或其反函数where也可以：

df['Label'] = (df['Label'].mask((df.index == df.groupby(['PersonID', 'Name', 'RuleID'])
                         ['RuleNumber'].transform('idxmin')), 'MAIN'))

OR或者

df['Label'] = (df['Label'].where((df.index != df.groupby(['PersonID', 'Name', 'RuleID'])
                         ['RuleNumber'].transform('idxmin')), 'MAIN'))

Answer 2

import pandas as pd

df = pd.DataFrame({'PersonID': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'Name': ["Jan", "Jan", "Jan", "Don", "Don", "Don", "Joe", "Joe", "Joe"],
'Label': ["REL", "REL", "REL", "REL", "REL", "REL", "REL", "REL", "REL"],
'RuleID': [55, 55, 55, 3, 3, 3, 10, 10, 10],
'RuleNumber': [3, 4, 5, 1, 2, 3, 234, 567, 999]})

df.loc[df.groupby('Name')['RuleNumber'].idxmin()[:], 'Label'] = 'MAIN'

Answer 3

Use duplicated on PersonID:在 PersonID 上使用duplicated ：

df.loc[~df['PersonID'].duplicated(),'Label'] = 'MAIN'
print(df)

Output: Output：

   PersonID Name Label  RuleID  RuleNumber
0         1  Jan  MAIN      55           3
1         1  Jan   REL      55           4
2         1  Jan   REL      55           5
3         2  Don  MAIN       3           1
4         2  Don   REL       3           2
5         2  Don   REL       3           3
6         3  Joe  MAIN      10         234
7         3  Joe   REL      10         567
8         3  Joe   REL      10         999

在 Pandas dataframe 中找到最小值并在新列上添加 label

问题描述

3 个解决方案

解决方案1
2 2020-12-17 20:14:22

解决方案2
2 2020-12-17 20:25:40

解决方案3
0 2020-12-17 20:45:15

在 Pandas dataframe 中找到最小值并在新列上添加 label

问题描述

3 个解决方案

解决方案1 2 2020-12-17 20:14:22

解决方案2 2 2020-12-17 20:25:40

解决方案3 0 2020-12-17 20:45:15

解决方案1
2 2020-12-17 20:14:22

解决方案2
2 2020-12-17 20:25:40

解决方案3
0 2020-12-17 20:45:15