简体   繁体   中英

How to groupby and pivot a dataframe with non-numeric values

I'm using Python, and I have a dataset of 6 columns, R, Rc, J, T, Ca and Cb. I need to "aggregate" on the columns "R" then "J", so that for each R, each row is a unique "J". Rc is a characteristic of R. Ca and Cb are characteristics of T. It will make more sense looking at the table below.

I need to go from:

#______________________            ________________________________________________________________
#| R  Rc  J  T  Ca  Cb|           |# R  Rc  J  Ca(T=1)  Ca(T=2)  Ca(T=3)  Cb(T=1)  Cb(T=2)  Cb(T=3)|
#| a   p  1  1  x    d|           |# a  p   1    x         y        z        d        e        f   |
#| a   p  1  2  y    e|           |# b  o   1    w                           g                     |  
#| a   p  1  3  z    f|  ----->   |# b  o   2    v                           h                     | 
#| b   o  1  1  w    g|           |# b  o   3    s                           i                     |
#| b   o  2  1  v    h|           |# c  n   1    t         r                 j        k            |
#| b   o  3  1  s    i|           |# c  n   2    u                           l                     |
#| c   n  1  1  t    j|           |________________________________________________________________|
#| c   n  1  2  r    k|           
#| c   n  2  1  u    l|
#|____________________|

data = {'R' : ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'], 
        'Rc': ['p', 'p', 'p', 'o', 'o', 'o', 'n', 'n', 'n'],
        'J' : [1, 1, 1, 1, 2, 3, 1, 1, 2], 
        'T' : [1, 2, 3, 1, 1, 1, 1, 2, 1], 
        'Ca': ['x', 'y', 'z', 'w', 'v', 's', 't', 'r', 'u'],
        'Cb': ['d', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l']}

df = pd.DataFrame(data=data)

I don't want to lose the data in Rc, Ca, or Cb.

Rc (or each column that ends in 'c') is the same for each R, so that can just be grouped with R.

But Ca and Cb (or each column that starts with 'C') are unique for each T, which will be aggregated and otherwise lost. These need to instead be saved in new columns named Ca(T=1) for when T=1, Ca(T=2) for when T=2, and Ca(T=3) for when T=3. The same goes for Cb.

So using T, I need to create T number of columns for each Ca and Cb given T, that writes the data from Ca and Cb into the new columns.

PS. If it helps, columns J and T both have an extra column with unique IDs.

J_ID = [1,1,1,2,3,4,5,5,6]
T_ID = [1,2,3,4,5,6,7,8,9]

What I tried so far:

(
    df.groupby(['R','J'])
    .apply(lambda x: x.Ca.tolist()).apply(pd.Series)
    .rename(columns=lambda x: f'Ca{x+1}')
    .reset_index()
)

Problem: Only possible to do with one of the C's and I lose Rc.

Any help would be greatly appreciated!

You can use pivot_table ( here the docs ) with a lambda function as aggfunc argument:

table = pd.pivot_table(df, index = ['R','Rc','J'],values = ['Ca','Cb'],
                    columns = ['T'], fill_value = '', aggfunc = lambda x: ''.join(str(v) for v in x)).reset_index()


   R Rc  J Ca       Cb      
T           1  2  3  1  2  3
0  a  p  1  x  y  z  d  e  f
1  b  o  1  w        g      
2  b  o  2  v        h      
3  b  o  3  s        i      
4  c  n  1  t  r     j  k   
5  c  n  2  u        l      

Then you can remove the multiindex columns and rename as follow (taken from this great answer ):

table.columns = ['%s%s' % (a, ' (T = %s)' % b if b else '') for a, b in table.columns]

   R Rc  J Ca (T = 1) Ca (T = 2) Ca (T = 3) Cb (T = 1) Cb (T = 2) Cb (T = 3)
0  a  p  1          x          y          z          d          e          f
1  b  o  1          w                                g                      
2  b  o  2          v                                h                      
3  b  o  3          s                                i                      
4  c  n  1          t          r                     j          k           
5  c  n  2          u                                l                      

If I understand what you need, you can simply locate the needed rows like this:

df['Ca(T=1)']=df['Ca'].loc[df['T']==1]

you have to repeat it for the different T's

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM