如何从值列表中将新列追加到pandas groupby对象

Question

I want to code a script that takes series values from a column, splits them into strings and makes a new column for each of the resulting strings (filled with NaN right now). 我想编写一个脚本，该脚本从一列中获取系列值，将其拆分为字符串，并为每个结果字符串创建一个新列（现在已用NaN填充）。 As the df is groupedby Column1, I want to do this for every group 由于df groupedby Column1 groupedby ，因此我想对每个组进行此操作

My input data frame looks like this: 我的输入数据框如下所示：

df1:
      Column1 Column2   
    0   L17      a,b,c,d,e
    1   L7       a,b,c
    2   L6       a,b,f
    3   L6       h,d,e

What I finally want to have is: 我最后想要拥有的是：

       Column1  Column2     a   b   c   d   e   f   h
    0   L17      a,b,c,d,e  nan nan nan nan nan nan nan
    1   L7       a,b,c      nan nan nan nan nan nan nan
    2   L6       a,b,f      nan nan nan nan nan nan nan

My code currently looks like this: 我的代码当前如下所示：

def NewCols(x):
    for item, frame in group['Column2'].iteritems():
        Genes = frame.split(',')
        for value in Genes:
            string = value
            x[string] = np.nan
            return x

df1.groupby('Column1').apply(NewCols)

My thought behind this was that the code loops through Column2 of every grouped object, splitting the values contained in frame at comma and creating a list for that group. 我的想法是，代码循环遍历每个分组对象的Column2，以逗号分隔frame中包含的值并为该组创建一个列表。 So far the code works fine. 到目前为止，代码工作正常。 Then I added 然后我加了

for value in Genes:
   string = value
   x[string] = np.nan
   return x

with the intention of adding a new column for every value contained in the list Genes . 目的是为Genes列表中包含的每个值添加一个新列。 However, my output looks like this: 但是，我的输出看起来像这样：

   Column1  Column2    d
0   L17      a,b,c,d,e nan
1   L7       a,b,c     nan
2   L6       a,b,f     nan
3   L6       h,d,e     nan

and I am pretty much struck dumb. 而且我几乎很傻。 Can someone explain why only one column gets appended (which is not even named after the first value in the first list of the first group) and suggest how I could improve my code? 有人可以解释为什么只附加一个列（甚至没有以第一组的第一个列表中的第一个值命名）并建议我如何改进代码吗？

Answer 1

I think you just return too early in your function, before the end of the two loops. 我认为您只是在两个循环结束之前return函数太早。 If you indent it back two times like this : 如果您将其缩进两次，如下所示：

def NewCols(x):
    for item, frame in group['Column2'].iteritems():
        Genes = frame.split(',')
        for value in Genes:
            string = value
            x[string] = np.nan
    return x

UngroupedResGenesLineage.groupby('Column1').apply(NewCols)

It should work fine ! 它应该工作正常！

Answer 2

cols = sorted(list(set(df1['Column2'].apply(lambda x: x.split(',')).sum())))
df = df1.groupby('Column1').agg(lambda x: ','.join(x)).reset_index()
pd.concat([df,pd.DataFrame({c:np.nan for c in cols}, index=df.index)], axis=1)

    Column1 Column2     a   b   c   d   e   f   h
0   L17     a,b,c,d,e   NaN NaN NaN NaN NaN NaN NaN
1   L6      a,b,f,h,d,e NaN NaN NaN NaN NaN NaN NaN
2   L7      a,b,c       NaN NaN NaN NaN NaN NaN NaN

如何从值列表中将新列追加到pandas groupby对象

问题描述

2 个解决方案

解决方案1
2 已采纳 2015-10-15 13:58:49

解决方案2
1 2015-10-15 16:42:42

如何从值列表中将新列追加到pandas groupby对象

问题描述

2 个解决方案

解决方案1 2 已采纳 2015-10-15 13:58:49

解决方案2 1 2015-10-15 16:42:42

解决方案1
2 已采纳 2015-10-15 13:58:49

解决方案2
1 2015-10-15 16:42:42