简体   繁体   中英

Find the minimum value in a Pandas dataframe and add a label on new column

What improvements can I make to my python pandas code to make it more efficient? For my case, I have this dataframe

In [1]: df = pd.DataFrame({'PersonID': [1, 1, 1, 2, 2, 2, 3, 3, 3],
                           'Name': ["Jan", "Jan", "Jan", "Don", "Don", "Don", "Joe", "Joe", "Joe"],
                           'Label': ["REL", "REL", "REL", "REL", "REL", "REL", "REL", "REL", "REL"],
                           'RuleID': [55, 55, 55, 3, 3, 3, 10, 10, 10],
                           'RuleNumber': [3, 4, 5, 1, 2, 3, 234, 567, 999]})

Which gives this result:

In [2]: df
Out[2]: 
   PersonID Name Label  RuleID  RuleNumber
0         1  Jan   REL      55          3
1         1  Jan   REL      55          4
2         1  Jan   REL      55          5
3         2  Don   REL       3          1
4         2  Don   REL       3          2
5         2  Don   REL       3          3
6         3  Joe   REL      10        234
7         3  Joe   REL      10        567
8         3  Joe   REL      10        999

What I need to accomplished here is to update the fields under the Label column to MAIN for the lowest rule value associated with each Rule ID that is applied to a Person ID and Name. Therefore, the results need to look like this:

In [3]: df
Out[3]:
   PersonID Name Label  RuleID  RuleNumber
0         1  Jan  MAIN      55           3
1         1  Jan   REL      55           4
2         1  Jan   REL      55           5
3         2  Don  MAIN       3           1
4         2  Don   REL       3           2
5         2  Don   REL       3           3
6         3  Joe  MAIN      10         234
7         3  Joe   REL      10         567
8         3  Joe   REL      10         999

This is the code that I wrote to accomplish this:

In [4]:

df['Label'] = np.where(
        df['RuleNumber'] ==
        df.groupby(['PersonID', 'Name', 'RuleID'])['RuleNumber'].transform('min'),
        "MAIN", df.Label)

Is there a better way to update the values under the Label column? I feel like I'm brute forcing my way through and this may not be the most efficient way to do this.

I used the following SO threads to arrive at my result:

Replace column values within a groupby and condition

Replace values within a groupby based on multiple conditions

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.idxmin.html

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transform.html

Using Pandas to Find Minimum Values of Grouped Rows

Any advice would be appreciated.

Thank you.

It seems like you can filter by the grouped idxmin regardless of sorted order and update RuleNumber based on that. You can use loc , np.where , mask , or where as follows:

df.loc[df.groupby(['PersonID', 'Name', 'RuleID'])['RuleNumber'].idxmin(), 'Label'] = 'MAIN'

OR with np.where as you were trying:

df['Label'] = (np.where((df.index == df.groupby(['PersonID', 'Name', 'RuleID'])
                         ['RuleNumber'].transform('idxmin')), 'MAIN', 'REL'))
df
Out[1]: 
   PersonID Name Label  RuleID  RuleNumber
0         1  Jan  MAIN      55           3
1         1  Jan   REL      55           4
2         1  Jan   REL      55           5
3         2  Don  MAIN       3           1
4         2  Don   REL       3           2
5         2  Don   REL       3           3
6         3  Joe  MAIN      10         234
7         3  Joe   REL      10         567
8         3  Joe   REL      10         999

Using mask or its inverse where would also work:

df['Label'] = (df['Label'].mask((df.index == df.groupby(['PersonID', 'Name', 'RuleID'])
                         ['RuleNumber'].transform('idxmin')), 'MAIN'))

OR

df['Label'] = (df['Label'].where((df.index != df.groupby(['PersonID', 'Name', 'RuleID'])
                         ['RuleNumber'].transform('idxmin')), 'MAIN'))
import pandas as pd

df = pd.DataFrame({'PersonID': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'Name': ["Jan", "Jan", "Jan", "Don", "Don", "Don", "Joe", "Joe", "Joe"],
'Label': ["REL", "REL", "REL", "REL", "REL", "REL", "REL", "REL", "REL"],
'RuleID': [55, 55, 55, 3, 3, 3, 10, 10, 10],
'RuleNumber': [3, 4, 5, 1, 2, 3, 234, 567, 999]})

df.loc[df.groupby('Name')['RuleNumber'].idxmin()[:], 'Label'] = 'MAIN'

Use duplicated on PersonID:

df.loc[~df['PersonID'].duplicated(),'Label'] = 'MAIN'
print(df)

Output:

   PersonID Name Label  RuleID  RuleNumber
0         1  Jan  MAIN      55           3
1         1  Jan   REL      55           4
2         1  Jan   REL      55           5
3         2  Don  MAIN       3           1
4         2  Don   REL       3           2
5         2  Don   REL       3           3
6         3  Joe  MAIN      10         234
7         3  Joe   REL      10         567
8         3  Joe   REL      10         999

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM