[英]pandas: Split separated values in a DataFrame column (one Series) into multiple Columns. Elegant solutions?
I have a column in a DataFrame (which is a column in a csv) which are comma-separated values. 我在DataFrame中有一个列(它是csv中的一列),它们是以逗号分隔的值。 I'd like to split this column into multiple columns.
我想将此列拆分为多个列。
The problem is an old one, and has been discussed here also, but there is one peculiarity: one entry may be from 0-n
comma-separated values. 问题是一个旧的问题,这里也讨论过,但有一个特点:一个条目可能来自
0-n
逗号分隔值。 An example: 一个例子:
df.head():
i: vals | sth_else
---------------------
1: a,b,c | ba
2: a,d | be
3: | bi
4: e,a,c | bo
5: e | bu
I'd like the following output (or similar, eg True/False): 我想要以下输出(或类似,例如True / False):
i : a | b | c | d | e | sth_else
-----------------------------------
1: 1 | 1 | 1 | 0 | 0 | ba
2: 1 | 0 | 0 | 1 | 0 | be
3: 0 | 0 | 0 | 0 | 0 | bi
4: 1 | 0 | 1 | 0 | 1 | bo
5: 0 | 0 | 0 | 0 | 1 | bu
I'm currently experimenting with the Series.str.split
and then Series.to_dict
functions, but with out any satisfactory results (causing always a ValueError: arrays must all be same length
. :) 我目前正在尝试使用
Series.str.split
和Series.to_dict
函数,但没有任何令人满意的结果(总是导致ValueError: arrays must all be same length
。:)
Also, I always try to find elegant solutions which are easily understandable when looked at after a couple of months ;). 此外,我总是试图找到优雅的解决方案,这些解决方案在几个月后查看时很容易理解;)。 In any case, propositions are highly appreciated!
在任何情况下,命题都非常感谢!
Here is the dummy.csv
for testing. 这是用于测试的
dummy.csv
。
vals;sth_else
a,b,c;ba
a,d;be
;bi
e,a,c;bo
e;bu
import pandas as pd
from StringIO import StringIO # py2.7 used here
# from io.StringIO import StringIO if you have py3.x
# data
# ==================================================================
csv_buffer = 'vals;sth_else\na,b,c;ba\na,d;be\n;bi\ne,a,c;bo\ne;bu'
df = pd.read_csv(StringIO(csv_buffer), sep=';')
Out[58]:
vals sth_else
0 a,b,c ba
1 a,d be
2 NaN bi
3 e,a,c bo
4 e bu
# processing
# ==================================================================
def func(group):
return pd.Series(group.vals.str.split(',').values[0], name='vals')
ser = df.groupby(level=0).apply(func)
Out[60]:
0 0 a
1 b
2 c
1 0 a
1 d
2 0 NaN
3 0 e
1 a
2 c
4 0 e
Name: vals, dtype: object
# use get_dummies, and then aggregate for each column of a b c d e to be its max (max is always 1 in this case)
pd.get_dummies(ser)
Out[85]:
a b c d e
0 0 1 0 0 0 0
1 0 1 0 0 0
2 0 0 1 0 0
1 0 1 0 0 0 0
1 0 0 0 1 0
2 0 0 0 0 0 0
3 0 0 0 0 0 1
1 1 0 0 0 0
2 0 0 1 0 0
4 0 0 0 0 0 1
# do this groupby on outer index level [0,1,2,3,4] and reduce any inner group from multiple rows to one row
df_dummies = pd.get_dummies(ser).groupby(level=0).apply(lambda group: group.max())
Out[64]:
a b c d e
0 1 1 1 0 0
1 1 0 0 1 0
2 0 0 0 0 0
3 1 0 1 0 1
4 0 0 0 0 1
df_dummies['sth_else'] = df.sth_else
Out[67]:
a b c d e sth_else
0 1 1 1 0 0 ba
1 1 0 0 1 0 be
2 0 0 0 0 0 bi
3 1 0 1 0 1 bo
4 0 0 0 0 1 bu
This is very similar to another question today. 这与今天的另一个问题非常相似。 As I said in that question, there may be a simple elegant pandas way to do this, but I also find it convenient to simply create a new data frame and populate it by iterating over the original one in the following fashion:
正如我在那个问题中所说,可能有一个简单优雅的pandas方法来做到这一点,但我也发现简单地创建一个新的数据框并通过以下列方式迭代原始数据框来填充它是很方便的:
#import and create your data
import pandas as pd
DF = pd.DataFrame({ 'vals' : ['a,b,c', 'a,d', '', 'e,a,c', 'e'],
'other' : ['ba', 'be', 'bi', 'bo', 'bu']
}, dtype = str)
Now create a new data frame with the other
column form the DF
as the index and columns that are drawn from the unique characters found in your val
column in the DF
: 现在创建一个新数据框,
other
列形成DF
作为索引,以及从DF
val
列中找到的唯一字符中提取的列:
New_DF = pd.DataFrame({col : 0 for col in
set([letter for letter in ''.join([char for char in DF.vals.values])
if letter.isalpha()])},
index = DF.other)
In [51]: New_DF
Out[51]:
a b c d e
other
ba 0 0 0 0 0
be 0 0 0 0 0
bi 0 0 0 0 0
bo 0 0 0 0 0
bu 0 0 0 0 0
Now simply iterate over the index of the New_DF
slicing the original DF
at that value and iterate over the columns to see if they appear in the relevant_string
: 现在,只需遍历的索引
New_DF
切原DF
在该值和遍历列,看看他们是否出现在relevant_string
:
for ind in New_DF.index:
relevant_string = str(DF[DF.other == ind].vals.values)
for col in list(New_DF.columns):
if col in relevant_string:
New_DF.loc[ind, col] += 1
Output looks like this 输出看起来像这样
In [54]: New_DF
Out[54]:
a b c d e
other
ba 1 1 1 0 0
be 1 0 0 1 0
bi 0 0 0 0 0
bo 1 0 1 0 1
bu 0 0 0 0 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.