简体   繁体   中英

Label columns based on grouped data

I am trying to create a column that consists of unique values for each id (each id has numerous rows associated with it), if the id has the tag answered associated with any of its rows, all the rows associated with that id should be labelled answered. If all the rows associated with an id have an unanswered tag, all the rows should be labelled as unanswered (this is currently what happens)

This is code I have written:

import numpy as np

conds = [file.data__answered_at.isna(),file.data__answered_at.notna()]
choices = ["not answered","answered"]
file['call_status'] = np.select(conds,choices,default=np.nan)

 data__id   call_status       rank
  1            answered        1
  1          not_answered      2
  1            answered        3
  2          not_answered      1
  2             answered       2
  3          not_answered      1
  4            answered        1
  4          not_answered      2
  5          not_answered      1
  5          not_answered      2

In this case the desired outcome would be

   data__id   call_status       rank
  1            answered        1
  1            answered        2
  1            answered        3
  2            answered        1
  2            answered        2
  3          not_answered      1
  4            answered        1
  4            answered        2
  5          not_answered      1
  5          not_answered      2

Use GroupBy.transform with GroupBy.any for test at least one answered per groups and set values by DataFrame.loc :

mask = df['call_status'].eq('answered').groupby(df['data__id']).transform('any')

Or get all data__id filtered by another column and test membership by Series.isin :

mask = df['data__id'].isin(df.loc[df['call_status'].eq('answered'), 'data__id'].unique())

df.loc[mask, 'call_status'] = 'answered'
print (df)
   data__id   call_status  rank
0         1      answered     1
1         1      answered     2
2         1      answered     3
3         2      answered     1
4         2      answered     2
5         3  not_answered     1
6         4      answered     1
7         4      answered     2
8         5  not_answered     1
9         5  not_answered     2

We can use groupby here and check if any of the rows are equal to answered .

Then we use np.where to conditionally fill in answered or not_answered

m = file.groupby('data__id')['call_status'].transform(lambda x: x.eq('answered').any())

file['call_status'] = np.where(m, 'answered', 'not_answered')

Output

  data__id   call_status  rank
0         1      answered     1
1         1      answered     2
2         1      answered     3
3         2      answered     1
4         2      answered     2
5         3  not_answered     1
6         4      answered     1
7         4      answered     2
8         5  not_answered     1
9         5  not_answered     2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM