简体   繁体   English

pandas:将DataFrame列(一个系列)中的分隔值拆分为多个列。 优雅的解决方

[英]pandas: Split separated values in a DataFrame column (one Series) into multiple Columns. Elegant solutions?

I have a column in a DataFrame (which is a column in a csv) which are comma-separated values. 我在DataFrame中有一个列(它是csv中的一列),它们是以逗号分隔的值。 I'd like to split this column into multiple columns. 我想将此列拆分为多个列。

The problem is an old one, and has been discussed here also, but there is one peculiarity: one entry may be from 0-n comma-separated values. 问题是一个旧的问题,这里也讨论过,但有一个特点:一个条目可能来自0-n逗号分隔值。 An example: 一个例子:

df.head():

i: vals   | sth_else 
---------------------
1: a,b,c  | ba
2: a,d    | be
3:        | bi
4: e,a,c  | bo
5: e      | bu

I'd like the following output (or similar, eg True/False): 我想要以下输出(或类似,例如True / False):

i : a | b | c | d | e |  sth_else 
-----------------------------------
1:  1 | 1 | 1 | 0 | 0 | ba
2:  1 | 0 | 0 | 1 | 0 | be
3:  0 | 0 | 0 | 0 | 0 | bi
4:  1 | 0 | 1 | 0 | 1 | bo
5:  0 | 0 | 0 | 0 | 1 | bu

I'm currently experimenting with the Series.str.split and then Series.to_dict functions, but with out any satisfactory results (causing always a ValueError: arrays must all be same length . :) 我目前正在尝试使用Series.str.splitSeries.to_dict函数,但没有任何令人满意的结果(总是导致ValueError: arrays must all be same length 。:)

Also, I always try to find elegant solutions which are easily understandable when looked at after a couple of months ;). 此外,我总是试图找到优雅的解决方案,这些解决方案在几个月后查看时很容易理解;)。 In any case, propositions are highly appreciated! 在任何情况下,命题都非常感谢!

Here is the dummy.csv for testing. 这是用于测试的dummy.csv

vals;sth_else 
a,b,c;ba
a,d;be
;bi
e,a,c;bo
e;bu
import pandas as pd
from StringIO import StringIO  # py2.7 used here
# from io.StringIO import StringIO    if you have py3.x

# data
# ==================================================================
csv_buffer = 'vals;sth_else\na,b,c;ba\na,d;be\n;bi\ne,a,c;bo\ne;bu'

df = pd.read_csv(StringIO(csv_buffer), sep=';')

Out[58]: 
    vals sth_else
0  a,b,c       ba
1    a,d       be
2    NaN       bi
3  e,a,c       bo
4      e       bu

# processing
# ==================================================================
def func(group):
    return pd.Series(group.vals.str.split(',').values[0], name='vals')

ser = df.groupby(level=0).apply(func)

Out[60]: 
0  0      a
   1      b
   2      c
1  0      a
   1      d
2  0    NaN
3  0      e
   1      a
   2      c
4  0      e
Name: vals, dtype: object


# use get_dummies, and then aggregate for each column of a b c d e to be its max (max is always 1 in this case)
pd.get_dummies(ser)

Out[85]: 
     a  b  c  d  e
0 0  1  0  0  0  0
  1  0  1  0  0  0
  2  0  0  1  0  0
1 0  1  0  0  0  0
  1  0  0  0  1  0
2 0  0  0  0  0  0
3 0  0  0  0  0  1
  1  1  0  0  0  0
  2  0  0  1  0  0
4 0  0  0  0  0  1

# do this groupby on outer index level [0,1,2,3,4] and reduce any inner group from multiple rows to one row
df_dummies = pd.get_dummies(ser).groupby(level=0).apply(lambda group: group.max())

Out[64]: 
   a  b  c  d  e
0  1  1  1  0  0
1  1  0  0  1  0
2  0  0  0  0  0
3  1  0  1  0  1
4  0  0  0  0  1


df_dummies['sth_else'] = df.sth_else

Out[67]: 
   a  b  c  d  e sth_else
0  1  1  1  0  0       ba
1  1  0  0  1  0       be
2  0  0  0  0  0       bi
3  1  0  1  0  1       bo
4  0  0  0  0  1       bu

This is very similar to another question today. 这与今天的另一个问题非常相似。 As I said in that question, there may be a simple elegant pandas way to do this, but I also find it convenient to simply create a new data frame and populate it by iterating over the original one in the following fashion: 正如我在那个问题中所说,可能有一个简单优雅的pandas方法来做到这一点,但我也发现简单地创建一个新的数据框并通过以下列方式迭代原始数据框来填充它是很方便的:

#import and create your data
import pandas as pd
DF = pd.DataFrame({ 'vals'  : ['a,b,c', 'a,d', '', 'e,a,c', 'e'],
                    'other' : ['ba', 'be', 'bi', 'bo', 'bu'] 
                  }, dtype = str)

Now create a new data frame with the other column form the DF as the index and columns that are drawn from the unique characters found in your val column in the DF : 现在创建一个新数据框, other列形成DF作为索引,以及从DF val列中找到的唯一字符中提取的列:

New_DF = pd.DataFrame({col : 0 for col in 
                             set([letter for letter in ''.join([char for char in DF.vals.values]) 
                             if letter.isalpha()])},
                             index = DF.other)

In [51]: New_DF
Out[51]: 
       a  b  c  d  e
other               
ba     0  0  0  0  0
be     0  0  0  0  0
bi     0  0  0  0  0
bo     0  0  0  0  0
bu     0  0  0  0  0

Now simply iterate over the index of the New_DF slicing the original DF at that value and iterate over the columns to see if they appear in the relevant_string : 现在,只需遍历的索引New_DF切原DF在该值和遍历列,看看他们是否出现在relevant_string

for ind in New_DF.index:
    relevant_string = str(DF[DF.other == ind].vals.values)
    for col in list(New_DF.columns):
        if col in relevant_string:
            New_DF.loc[ind, col] += 1

Output looks like this 输出看起来像这样

In [54]: New_DF
Out[54]: 
       a  b  c  d  e
other               
ba     1  1  1  0  0
be     1  0  0  1  0
bi     0  0  0  0  0
bo     1  0  1  0  1
bu     0  0  0  0  1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用条件将列表的pandas列拆分为多个列。 - Using condition to split pandas column of lists into multiple columns. 按一列分组,然后平均各列的 rest。 Pandas dataframe - Group by one column and then average each of the rest of the columns. Pandas dataframe 将具有 pandas 系列的 dataframe 列拆分为多个列 - Split a dataframe column having a pandas series into multiple columns 将文本(带有名称和值)列拆分为 Pandas DataFrame 中的多列 - Split a text(with names and values) column into multiple columns in Pandas DataFrame 通过分隔符拆分列中的值并将值分配给 Pandas dataframe 中的多个列 - Split values in a column by delimiter and assign value to multiple columns in Pandas dataframe 如何将一系列Pandas数据框行变成具有多个值的一列? - How to turn a series of Pandas dataframe rows into one column with multiple values? 我在 pandas dataframe 列中有字典作为值。 我想将键列和值作为列值 - I have dictionary as value in pandas dataframe columns. I want to make the keys columns and values as column value 将Pandas数据框列中的列表拆分为多列 - Split list in Pandas dataframe column into multiple columns Pandas Dataframe,在一列中拆分双精度值 - Pandas Dataframe, split double values in one column Pandas DataFrame 高效将一列拆分为多列 - Pandas DataFrame efficiently split one column into multiple
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM