简体   繁体   中英

How to append new columns to a pandas groupby object from a list of values

I want to code a script that takes series values from a column, splits them into strings and makes a new column for each of the resulting strings (filled with NaN right now). As the df is groupedby Column1, I want to do this for every group

My input data frame looks like this:

df1:
      Column1 Column2   
    0   L17      a,b,c,d,e
    1   L7       a,b,c
    2   L6       a,b,f
    3   L6       h,d,e

What I finally want to have is:

       Column1  Column2     a   b   c   d   e   f   h
    0   L17      a,b,c,d,e  nan nan nan nan nan nan nan
    1   L7       a,b,c      nan nan nan nan nan nan nan
    2   L6       a,b,f      nan nan nan nan nan nan nan

My code currently looks like this:

def NewCols(x):
    for item, frame in group['Column2'].iteritems():
        Genes = frame.split(',')
        for value in Genes:
            string = value
            x[string] = np.nan
            return x

df1.groupby('Column1').apply(NewCols)

My thought behind this was that the code loops through Column2 of every grouped object, splitting the values contained in frame at comma and creating a list for that group. So far the code works fine. Then I added

for value in Genes:
   string = value
   x[string] = np.nan
   return x

with the intention of adding a new column for every value contained in the list Genes . However, my output looks like this:

   Column1  Column2    d
0   L17      a,b,c,d,e nan
1   L7       a,b,c     nan
2   L6       a,b,f     nan
3   L6       h,d,e     nan

and I am pretty much struck dumb. Can someone explain why only one column gets appended (which is not even named after the first value in the first list of the first group) and suggest how I could improve my code?

I think you just return too early in your function, before the end of the two loops. If you indent it back two times like this :

def NewCols(x):
    for item, frame in group['Column2'].iteritems():
        Genes = frame.split(',')
        for value in Genes:
            string = value
            x[string] = np.nan
    return x

UngroupedResGenesLineage.groupby('Column1').apply(NewCols)

It should work fine !

cols = sorted(list(set(df1['Column2'].apply(lambda x: x.split(',')).sum())))
df = df1.groupby('Column1').agg(lambda x: ','.join(x)).reset_index()
pd.concat([df,pd.DataFrame({c:np.nan for c in cols}, index=df.index)], axis=1)

    Column1 Column2     a   b   c   d   e   f   h
0   L17     a,b,c,d,e   NaN NaN NaN NaN NaN NaN NaN
1   L6      a,b,f,h,d,e NaN NaN NaN NaN NaN NaN NaN
2   L7      a,b,c       NaN NaN NaN NaN NaN NaN NaN

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM