如何在 Pandas 中拆分多列

Question

I have a data frame like below:我有一个如下所示的数据框：

df = pd.DataFrame({'var1': ['0,3788,99,20.88', '3,99022,08,91.995'],
                   'var2': ['0,929,92,299.90', '1,38333,9,993.11'],
                   'var3': ['8,9332,99,29.10', '7,922111,07,45.443']})
Out[248]: 
                var1              var2                var3
0    0,3788,99,20.88   0,929,92,299.90     8,9332,99,29.10
1  3,99022,08,91.995  1,38333,9,993.11  7,922111,07,45.443

I want to split each column on comma and same the new set of columns next to each other.我想用逗号分割每一列，并将新的一组列彼此相邻。 So the resulting data frame should look like below:因此生成的数据框应如下所示：

df2 = pd.DataFrame({('var1', 'x1'): [0, 3], ('var1', 'x2'): [3788, 99022], ('var1', 'x3'): [99, '08'], ('var1', 'x4'): [20.88, 91.995],
                    ('var2', 'x1'): [0, 1], ('var2', 'x2'): [929, 38333], ('var2', 'x3'): [92, 9], ('var2', 'x4'): [299.90, 993.11],
                    ('var3', 'x1'): [8, 7], ('var3', 'x2'): [9332, 922111], ('var3', 'x3'): [99, '07'], ('var3', 'x4'): [29.10, 45.443]})

Out[249]: 
  var1                    var2                    var3                    
    x1     x2  x3      x4   x1     x2  x3      x4   x1      x2  x3      x4
0    0   3788  99  20.880    0    929  92  299.90    8    9332  99  29.100
1    3  99022  08  91.995    1  38333   9  993.11    7  922111  07  45.443

The MultiIndex is not mandatory, but then I'd like to have an opportunity to easily gather the data and obtain df3 if needed: MultiIndex不是强制性的，但是如果需要，我希望有机会轻松收集数据并获取 df3：

    var  x1      x2  x3       x4
0  var1   0    3788  99   20.880
1  var1   3   99022  08   91.995
0  var2   0     929  92  299.900
1  var2   1   38333   9  993.110
0  var3   8    9332  99   29.100
1  var3   7  922111  07   45.443

My effort included pd.melt and str.split :我的努力包括pd.melt和str.split ：

df_long = pd.melt(df.reset_index(drop = False), id_vars = 'index', var_name = 'var', value_name = 'values') \
    .sort_values(['index', 'var']) \
    .set_index('index')
df_long = df_long['values'].str.split(',', expand = True)
df_long.columns = ['x' + str(i) for i in range(df_long.shape[1])]

But: 1) I don't know how to then spread the data for different var1, var2, var3... next to each other 2) transforming from wide format to long format ( df to df_long ) and back again ( df_long to df3 ) seems highly inefficient and I care for performance with the seeking solution.但是：1）我不知道如何然后将不同var1, var2, var3...的数据彼此相邻 2）从宽格式转换为长格式（ df到df_long ）并再次返回（ df_long到df3 ) 似乎效率很低，我关心寻求解决方案的性能。

So what's the best way to transform from df to df2 , so that we could then easily obtain df3 if needed?那么从df转换为df2的最佳方法是什么，以便我们可以在需要时轻松获得df3 ？

Answer 1

You can use stack , str.split() with expand=True , unstack() to achieve this:您可以使用stack 、 str.split()和expand=True 、 unstack()来实现：

final=(df.stack().str.split(',',expand=True).unstack().swaplevel(axis=1)
                                             .sort_index(level=0,axis=1))
print(final)

     var1                    var2                    var3                    
     0      1   2       3    0      1   2       3    0       1   2       3
0    0   3788  99   20.88    0    929  92  299.90    8    9332  99   29.10
1    3  99022  08  91.995    1  38333   9  993.11    7  922111  07  45.443

For renaming the 0th level of the columns, use;要重命名列的第 0 级，请使用；

final.columns=pd.MultiIndex.from_tuples([(a,f'x{b}') for a,b in final.columns])

   var1                    var2                    var3                       
    x0     x1  x2      x3   x0     x1  x2      x3   x0      x1  x2      x3
0    0   3788  99   20.88    0    929  92  299.90    8    9332  99   29.10
1    3  99022  08  91.995    1  38333   9  993.11    7  922111  07  45.443

You can also use the below for the second output shown in your question:您还可以将以下内容用于问题中显示的第二个输出：

df.stack().str.split(',',expand=True).add_prefix('x').reset_index(1).reset_index(drop=True)

  level_1 x0      x1  x2      x3
0    var1  0    3788  99   20.88
1    var2  0     929  92  299.90
2    var3  8    9332  99   29.10
3    var1  3   99022  08  91.995
4    var2  1   38333   9  993.11
5    var3  7  922111  07  45.443

Answer 2

Here is an approach that gets df3 first:这是一种首先获取 df3 的方法：

df3 = pd.concat([df[s].str.split(',', expand=True).add_prefix("x").assign(var=s) for s in df])

print(df3)

  x0      x1  x2      x3   var
0  0    3788  99   20.88  var1
1  3   99022  08  91.995  var1
0  0     929  92  299.90  var2
1  1   38333   9  993.11  var2
0  8    9332  99   29.10  var3
1  7  922111  07  45.443  var3

And then:进而：

df2 = df3.set_index("var", append=True).unstack().swaplevel(axis=1).sort_index(axis=1)

print(df2)

var var1                    var2                    var3                    
      x0     x1  x2      x3   x0     x1  x2      x3   x0      x1  x2      x3
0      0   3788  99   20.88    0    929  92  299.90    8    9332  99   29.10
1      3  99022  08  91.995    1  38333   9  993.11    7  922111  07  45.443

Answer 3

Start form defining a function to reformat a single cell :开始定义一个函数来重新格式化单个单元格：

def refCell(cell, ind1):
    tbl = cell.split(',')
    ind2 = [ 'x' + str(i) for i in range(1, len(tbl) + 1) ]
    ind = pd.MultiIndex.from_product([[ind1], ind2])
    return pd.Series(tbl, index=ind)

It creates a Series with values resulting from splitting a cell, with a MultiIndex, where:它创建一个带有多个值的系列，这些值是通过拆分单元格产生的，具有 MultiIndex，其中：

The first level is ind1 .第一级是ind1 。
The second level is x1 , x2 and so on (a list of strings).第二级是x1 、 x2等（字符串列表）。

The second function to define is a function to reformat a row :要定义的第二个函数是重新格式化行的函数：

def refRow(row):
    return pd.concat([ refCell(val, idx) for idx, val in row.iteritems() ])

Then, to get the result, apply this function (to each row):然后，要获得结果，请应用此函数（对每一行）：

df.apply(refRow, axis=1)

如何在 Pandas 中拆分多列

问题描述

3 个解决方案

解决方案1
1 2019-12-08 11:00:34

解决方案2
1 2019-12-08 11:12:22

解决方案3
0 2019-12-08 11:52:02

如何在 Pandas 中拆分多列

问题描述

3 个解决方案

解决方案1 1 2019-12-08 11:00:34

解决方案2 1 2019-12-08 11:12:22

解决方案3 0 2019-12-08 11:52:02

解决方案1
1 2019-12-08 11:00:34

解决方案2
1 2019-12-08 11:12:22

解决方案3
0 2019-12-08 11:52:02