简体   繁体   English

如何在 Pandas 中拆分多列

[英]How to split multiple columns in Pandas

I have a data frame like below:我有一个如下所示的数据框:

df = pd.DataFrame({'var1': ['0,3788,99,20.88', '3,99022,08,91.995'],
                   'var2': ['0,929,92,299.90', '1,38333,9,993.11'],
                   'var3': ['8,9332,99,29.10', '7,922111,07,45.443']})
Out[248]: 
                var1              var2                var3
0    0,3788,99,20.88   0,929,92,299.90     8,9332,99,29.10
1  3,99022,08,91.995  1,38333,9,993.11  7,922111,07,45.443

I want to split each column on comma and same the new set of columns next to each other.我想用逗号分割每一列,并将新的一组列彼此相邻。 So the resulting data frame should look like below:因此生成的数据框应如下所示:

df2 = pd.DataFrame({('var1', 'x1'): [0, 3], ('var1', 'x2'): [3788, 99022], ('var1', 'x3'): [99, '08'], ('var1', 'x4'): [20.88, 91.995],
                    ('var2', 'x1'): [0, 1], ('var2', 'x2'): [929, 38333], ('var2', 'x3'): [92, 9], ('var2', 'x4'): [299.90, 993.11],
                    ('var3', 'x1'): [8, 7], ('var3', 'x2'): [9332, 922111], ('var3', 'x3'): [99, '07'], ('var3', 'x4'): [29.10, 45.443]})

Out[249]: 
  var1                    var2                    var3                    
    x1     x2  x3      x4   x1     x2  x3      x4   x1      x2  x3      x4
0    0   3788  99  20.880    0    929  92  299.90    8    9332  99  29.100
1    3  99022  08  91.995    1  38333   9  993.11    7  922111  07  45.443

The MultiIndex is not mandatory, but then I'd like to have an opportunity to easily gather the data and obtain df3 if needed: MultiIndex不是强制性的,但是如果需要,我希望有机会轻松收集数据并获取 df3:

    var  x1      x2  x3       x4
0  var1   0    3788  99   20.880
1  var1   3   99022  08   91.995
0  var2   0     929  92  299.900
1  var2   1   38333   9  993.110
0  var3   8    9332  99   29.100
1  var3   7  922111  07   45.443

My effort included pd.melt and str.split :我的努力包括pd.meltstr.split

df_long = pd.melt(df.reset_index(drop = False), id_vars = 'index', var_name = 'var', value_name = 'values') \
    .sort_values(['index', 'var']) \
    .set_index('index')
df_long = df_long['values'].str.split(',', expand = True)
df_long.columns = ['x' + str(i) for i in range(df_long.shape[1])]

But: 1) I don't know how to then spread the data for different var1, var2, var3... next to each other 2) transforming from wide format to long format ( df to df_long ) and back again ( df_long to df3 ) seems highly inefficient and I care for performance with the seeking solution.但是:1)我不知道如何然后将不同var1, var2, var3...的数据彼此相邻 2)从宽格式转换为长格式( dfdf_long )并再次返回( df_longdf3 ) 似乎效率很低,我关心寻求解决方案的性能。

So what's the best way to transform from df to df2 , so that we could then easily obtain df3 if needed?那么从df转换为df2的最佳方法是什么,以便我们可以在需要时轻松获得df3

You can use stack , str.split() with expand=True , unstack() to achieve this:您可以使用stackstr.split()expand=Trueunstack()来实现:

final=(df.stack().str.split(',',expand=True).unstack().swaplevel(axis=1)
                                             .sort_index(level=0,axis=1))
print(final)

     var1                    var2                    var3                    
     0      1   2       3    0      1   2       3    0       1   2       3
0    0   3788  99   20.88    0    929  92  299.90    8    9332  99   29.10
1    3  99022  08  91.995    1  38333   9  993.11    7  922111  07  45.443

For renaming the 0th level of the columns, use;要重命名列的第 0 级,请使用;

final.columns=pd.MultiIndex.from_tuples([(a,f'x{b}') for a,b in final.columns])

   var1                    var2                    var3                       
    x0     x1  x2      x3   x0     x1  x2      x3   x0      x1  x2      x3
0    0   3788  99   20.88    0    929  92  299.90    8    9332  99   29.10
1    3  99022  08  91.995    1  38333   9  993.11    7  922111  07  45.443

You can also use the below for the second output shown in your question:您还可以将以下内容用于问题中显示的第二个输出:

df.stack().str.split(',',expand=True).add_prefix('x').reset_index(1).reset_index(drop=True)

  level_1 x0      x1  x2      x3
0    var1  0    3788  99   20.88
1    var2  0     929  92  299.90
2    var3  8    9332  99   29.10
3    var1  3   99022  08  91.995
4    var2  1   38333   9  993.11
5    var3  7  922111  07  45.443

Here is an approach that gets df3 first:这是一种首先获取 df3 的方法:

df3 = pd.concat([df[s].str.split(',', expand=True).add_prefix("x").assign(var=s) for s in df])

print(df3)
  x0      x1  x2      x3   var
0  0    3788  99   20.88  var1
1  3   99022  08  91.995  var1
0  0     929  92  299.90  var2
1  1   38333   9  993.11  var2
0  8    9332  99   29.10  var3
1  7  922111  07  45.443  var3

And then:进而:

df2 = df3.set_index("var", append=True).unstack().swaplevel(axis=1).sort_index(axis=1)

print(df2)
var var1                    var2                    var3                    
      x0     x1  x2      x3   x0     x1  x2      x3   x0      x1  x2      x3
0      0   3788  99   20.88    0    929  92  299.90    8    9332  99   29.10
1      3  99022  08  91.995    1  38333   9  993.11    7  922111  07  45.443

Start form defining a function to reformat a single cell :开始定义一个函数来重新格式化单个单元格

def refCell(cell, ind1):
    tbl = cell.split(',')
    ind2 = [ 'x' + str(i) for i in range(1, len(tbl) + 1) ]
    ind = pd.MultiIndex.from_product([[ind1], ind2])
    return pd.Series(tbl, index=ind)

It creates a Series with values resulting from splitting a cell, with a MultiIndex, where:它创建一个带有多个值的系列,这些值是通过拆分单元格产生的,具有 MultiIndex,其中:

  • The first level is ind1 .第一级是ind1
  • The second level is x1 , x2 and so on (a list of strings).第二级是x1x2等(字符串列表)。

The second function to define is a function to reformat a row :要定义的第二个函数是重新格式化的函数:

def refRow(row):
    return pd.concat([ refCell(val, idx) for idx, val in row.iteritems() ])

Then, to get the result, apply this function (to each row):然后,要获得结果,请应用此函数(对每一行):

df.apply(refRow, axis=1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM