pandas：将DataFrame列（一个系列）中的分隔值拆分为多个列。优雅的解决方

Question

I have a column in a DataFrame (which is a column in a csv) which are comma-separated values. 我在DataFrame中有一个列（它是csv中的一列），它们是以逗号分隔的值。 I'd like to split this column into multiple columns. 我想将此列拆分为多个列。

The problem is an old one, and has been discussed here also, but there is one peculiarity: one entry may be from 0-n comma-separated values. 问题是一个旧的问题，这里也讨论过，但有一个特点：一个条目可能来自0-n逗号分隔值。 An example: 一个例子：

df.head():

i: vals   | sth_else 
---------------------
1: a,b,c  | ba
2: a,d    | be
3:        | bi
4: e,a,c  | bo
5: e      | bu

I'd like the following output (or similar, eg True/False): 我想要以下输出（或类似，例如True / False）：

i : a | b | c | d | e |  sth_else 
-----------------------------------
1:  1 | 1 | 1 | 0 | 0 | ba
2:  1 | 0 | 0 | 1 | 0 | be
3:  0 | 0 | 0 | 0 | 0 | bi
4:  1 | 0 | 1 | 0 | 1 | bo
5:  0 | 0 | 0 | 0 | 1 | bu

I'm currently experimenting with the Series.str.split and then Series.to_dict functions, but with out any satisfactory results (causing always a ValueError: arrays must all be same length . :) 我目前正在尝试使用Series.str.split和Series.to_dict函数，但没有任何令人满意的结果（总是导致ValueError: arrays must all be same length 。:)

Also, I always try to find elegant solutions which are easily understandable when looked at after a couple of months ;). 此外，我总是试图找到优雅的解决方案，这些解决方案在几个月后查看时很容易理解;）。 In any case, propositions are highly appreciated! 在任何情况下，命题都非常感谢！

Here is the dummy.csv for testing. 这是用于测试的dummy.csv 。

vals;sth_else 
a,b,c;ba
a,d;be
;bi
e,a,c;bo
e;bu

Answer 1

import pandas as pd
from StringIO import StringIO  # py2.7 used here
# from io.StringIO import StringIO    if you have py3.x

# data
# ==================================================================
csv_buffer = 'vals;sth_else\na,b,c;ba\na,d;be\n;bi\ne,a,c;bo\ne;bu'

df = pd.read_csv(StringIO(csv_buffer), sep=';')

Out[58]: 
    vals sth_else
0  a,b,c       ba
1    a,d       be
2    NaN       bi
3  e,a,c       bo
4      e       bu

# processing
# ==================================================================
def func(group):
    return pd.Series(group.vals.str.split(',').values[0], name='vals')

ser = df.groupby(level=0).apply(func)

Out[60]: 
0  0      a
   1      b
   2      c
1  0      a
   1      d
2  0    NaN
3  0      e
   1      a
   2      c
4  0      e
Name: vals, dtype: object


# use get_dummies, and then aggregate for each column of a b c d e to be its max (max is always 1 in this case)
pd.get_dummies(ser)

Out[85]: 
     a  b  c  d  e
0 0  1  0  0  0  0
  1  0  1  0  0  0
  2  0  0  1  0  0
1 0  1  0  0  0  0
  1  0  0  0  1  0
2 0  0  0  0  0  0
3 0  0  0  0  0  1
  1  1  0  0  0  0
  2  0  0  1  0  0
4 0  0  0  0  0  1

# do this groupby on outer index level [0,1,2,3,4] and reduce any inner group from multiple rows to one row
df_dummies = pd.get_dummies(ser).groupby(level=0).apply(lambda group: group.max())

Out[64]: 
   a  b  c  d  e
0  1  1  1  0  0
1  1  0  0  1  0
2  0  0  0  0  0
3  1  0  1  0  1
4  0  0  0  0  1


df_dummies['sth_else'] = df.sth_else

Out[67]: 
   a  b  c  d  e sth_else
0  1  1  1  0  0       ba
1  1  0  0  1  0       be
2  0  0  0  0  0       bi
3  1  0  1  0  1       bo
4  0  0  0  0  1       bu

Answer 2

This is very similar to another question today. 这与今天的另一个问题非常相似。 As I said in that question, there may be a simple elegant pandas way to do this, but I also find it convenient to simply create a new data frame and populate it by iterating over the original one in the following fashion: 正如我在那个问题中所说，可能有一个简单优雅的pandas方法来做到这一点，但我也发现简单地创建一个新的数据框并通过以下列方式迭代原始数据框来填充它是很方便的：

#import and create your data
import pandas as pd
DF = pd.DataFrame({ 'vals'  : ['a,b,c', 'a,d', '', 'e,a,c', 'e'],
                    'other' : ['ba', 'be', 'bi', 'bo', 'bu'] 
                  }, dtype = str)

Now create a new data frame with the other column form the DF as the index and columns that are drawn from the unique characters found in your val column in the DF : 现在创建一个新数据框， other列形成DF作为索引，以及从DF val列中找到的唯一字符中提取的列：

New_DF = pd.DataFrame({col : 0 for col in 
                             set([letter for letter in ''.join([char for char in DF.vals.values]) 
                             if letter.isalpha()])},
                             index = DF.other)

In [51]: New_DF
Out[51]: 
       a  b  c  d  e
other               
ba     0  0  0  0  0
be     0  0  0  0  0
bi     0  0  0  0  0
bo     0  0  0  0  0
bu     0  0  0  0  0

Now simply iterate over the index of the New_DF slicing the original DF at that value and iterate over the columns to see if they appear in the relevant_string : 现在，只需遍历的索引New_DF切原DF在该值和遍历列，看看他们是否出现在relevant_string ：

for ind in New_DF.index:
    relevant_string = str(DF[DF.other == ind].vals.values)
    for col in list(New_DF.columns):
        if col in relevant_string:
            New_DF.loc[ind, col] += 1

Output looks like this 输出看起来像这样

In [54]: New_DF
Out[54]: 
       a  b  c  d  e
other               
ba     1  1  1  0  0
be     1  0  0  1  0
bi     0  0  0  0  0
bo     1  0  1  0  1
bu     0  0  0  0  1

pandas：将DataFrame列（一个系列）中的分隔值拆分为多个列。优雅的解决方

问题描述

2 个解决方案

解决方案1
3 已采纳 2015-07-10 07:24:57

解决方案2
1 2015-07-10 07:23:59

pandas：将DataFrame列（一个系列）中的分隔值拆分为多个列。 优雅的解决方

问题描述

2 个解决方案

解决方案1 3 已采纳 2015-07-10 07:24:57

解决方案2 1 2015-07-10 07:23:59

pandas：将DataFrame列（一个系列）中的分隔值拆分为多个列。优雅的解决方

解决方案1
3 已采纳 2015-07-10 07:24:57

解决方案2
1 2015-07-10 07:23:59