简体   繁体   中英

Add a new column with values based on groupby values two other columns

I'm trying to add a new column to a dataframe, based on a groupby of the original dataframe, and assigning one of three values based on the results of that groupby

ie In the df below I want to add a fourth column to show whether I have too many 'A's, too many 'C's, or just the right amount.

    col1 col2  col3
0   1111    A     1
1   1111    B     3
2   1111    B     3
3   1111    B     3
4   1111    C     1
5   2222    A     1
6   2222    B     1
7   2222    C     2
8   2222    C     2
9   3333    A     2
10  3333    A     2
11  3333    B     1
12  3333    C     1

becomes...

    col1 col2  col3 col4
0   1111    A     1   OK
1   1111    B     3   OK
2   1111    B     3   OK
3   1111    B     3   OK
4   1111    C     1   OK
5   2222    A     1   >C
6   2222    B     1   >C
7   2222    C     2   >C
8   2222    C     2   >C
9   3333    A     2   >A
10  3333    A     2   >A
11  3333    B     1   >A
12  3333    C     1   >A

I was thinking of using transform, but that returns series, and I think in this case I need to check two different columns?

Example:

d1 = {'col1': ['1111', '1111', '1111', '1111', '1111', '2222', '2222', '2222', '2222', '3333', '3333', '3333', '3333'],
      'col2': ['A', 'B', 'B', 'B', 'C', 'A', 'B', 'C', 'C', 'A', 'A', 'B', 'C'],
      'col3': [1, 3, 3, 3, 1, 1, 1, 2, 2, 2, 2, 1, 1]}
df1 = pd.DataFrame(data=d1)
d2 = {'col1': ['1111', '1111', '1111', '1111', '1111', '2222', '2222', '2222', '2222', '3333', '3333', '3333', '3333'],
      'col2': ['A', 'B', 'B', 'B', 'C', 'A', 'B', 'C', 'C', 'A', 'A', 'B', 'C'],
      'col3': [1, 3, 3, 3, 1, 1, 1, 2, 2, 2, 2, 1, 1],
      'col4': ['OK', 'OK', 'OK', 'OK', 'OK', '>C', '>C', '>C', '>C', '>A', '>A', '>A', '>A']}
df2 = pd.DataFrame(data=d2)
print(df1)
print(df2)

From what I understand, try this:

Get counts using crosstab of col2 wrt col1 values then using .loc[] filter the values you want to consider, then using df.gt compare if count is greater than 1 and do a dot multiplication with column names, finally map it back with series.fillna

Solution:

values_to_filter = ['A','C'] #put B for testing and it will show >B for first group
m = pd.crosstab(df1['col1'],df1['col2']).loc[:,values_to_filter]
df1['col4'] = (df1['col1'].map(m.gt(1).dot(m.columns).replace('',np.nan)
                                             .radd('>')).fillna('OK'))

Output:

print(df1)

    col1 col2  col3 col4
0   1111    A     1   OK
1   1111    B     3   OK
2   1111    B     3   OK
3   1111    B     3   OK
4   1111    C     1   OK
5   2222    A     1   >C
6   2222    B     1   >C
7   2222    C     2   >C
8   2222    C     2   >C
9   3333    A     2   >A
10  3333    A     2   >A
11  3333    B     1   >A
12  3333    C     1   >A

Additional Details: Where m is a count of col2 values wrt col1 values:

print(m)

col2  A  C
col1      
1111  1  1
2222  1  2
3333  2  1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM