简体   繁体   中英

How to split multiple columns in Pandas

I have a data frame like below:

df = pd.DataFrame({'var1': ['0,3788,99,20.88', '3,99022,08,91.995'],
                   'var2': ['0,929,92,299.90', '1,38333,9,993.11'],
                   'var3': ['8,9332,99,29.10', '7,922111,07,45.443']})
Out[248]: 
                var1              var2                var3
0    0,3788,99,20.88   0,929,92,299.90     8,9332,99,29.10
1  3,99022,08,91.995  1,38333,9,993.11  7,922111,07,45.443

I want to split each column on comma and same the new set of columns next to each other. So the resulting data frame should look like below:

df2 = pd.DataFrame({('var1', 'x1'): [0, 3], ('var1', 'x2'): [3788, 99022], ('var1', 'x3'): [99, '08'], ('var1', 'x4'): [20.88, 91.995],
                    ('var2', 'x1'): [0, 1], ('var2', 'x2'): [929, 38333], ('var2', 'x3'): [92, 9], ('var2', 'x4'): [299.90, 993.11],
                    ('var3', 'x1'): [8, 7], ('var3', 'x2'): [9332, 922111], ('var3', 'x3'): [99, '07'], ('var3', 'x4'): [29.10, 45.443]})

Out[249]: 
  var1                    var2                    var3                    
    x1     x2  x3      x4   x1     x2  x3      x4   x1      x2  x3      x4
0    0   3788  99  20.880    0    929  92  299.90    8    9332  99  29.100
1    3  99022  08  91.995    1  38333   9  993.11    7  922111  07  45.443

The MultiIndex is not mandatory, but then I'd like to have an opportunity to easily gather the data and obtain df3 if needed:

    var  x1      x2  x3       x4
0  var1   0    3788  99   20.880
1  var1   3   99022  08   91.995
0  var2   0     929  92  299.900
1  var2   1   38333   9  993.110
0  var3   8    9332  99   29.100
1  var3   7  922111  07   45.443

My effort included pd.melt and str.split :

df_long = pd.melt(df.reset_index(drop = False), id_vars = 'index', var_name = 'var', value_name = 'values') \
    .sort_values(['index', 'var']) \
    .set_index('index')
df_long = df_long['values'].str.split(',', expand = True)
df_long.columns = ['x' + str(i) for i in range(df_long.shape[1])]

But: 1) I don't know how to then spread the data for different var1, var2, var3... next to each other 2) transforming from wide format to long format ( df to df_long ) and back again ( df_long to df3 ) seems highly inefficient and I care for performance with the seeking solution.

So what's the best way to transform from df to df2 , so that we could then easily obtain df3 if needed?

You can use stack , str.split() with expand=True , unstack() to achieve this:

final=(df.stack().str.split(',',expand=True).unstack().swaplevel(axis=1)
                                             .sort_index(level=0,axis=1))
print(final)

     var1                    var2                    var3                    
     0      1   2       3    0      1   2       3    0       1   2       3
0    0   3788  99   20.88    0    929  92  299.90    8    9332  99   29.10
1    3  99022  08  91.995    1  38333   9  993.11    7  922111  07  45.443

For renaming the 0th level of the columns, use;

final.columns=pd.MultiIndex.from_tuples([(a,f'x{b}') for a,b in final.columns])

   var1                    var2                    var3                       
    x0     x1  x2      x3   x0     x1  x2      x3   x0      x1  x2      x3
0    0   3788  99   20.88    0    929  92  299.90    8    9332  99   29.10
1    3  99022  08  91.995    1  38333   9  993.11    7  922111  07  45.443

You can also use the below for the second output shown in your question:

df.stack().str.split(',',expand=True).add_prefix('x').reset_index(1).reset_index(drop=True)

  level_1 x0      x1  x2      x3
0    var1  0    3788  99   20.88
1    var2  0     929  92  299.90
2    var3  8    9332  99   29.10
3    var1  3   99022  08  91.995
4    var2  1   38333   9  993.11
5    var3  7  922111  07  45.443

Here is an approach that gets df3 first:

df3 = pd.concat([df[s].str.split(',', expand=True).add_prefix("x").assign(var=s) for s in df])

print(df3)
  x0      x1  x2      x3   var
0  0    3788  99   20.88  var1
1  3   99022  08  91.995  var1
0  0     929  92  299.90  var2
1  1   38333   9  993.11  var2
0  8    9332  99   29.10  var3
1  7  922111  07  45.443  var3

And then:

df2 = df3.set_index("var", append=True).unstack().swaplevel(axis=1).sort_index(axis=1)

print(df2)
var var1                    var2                    var3                    
      x0     x1  x2      x3   x0     x1  x2      x3   x0      x1  x2      x3
0      0   3788  99   20.88    0    929  92  299.90    8    9332  99   29.10
1      3  99022  08  91.995    1  38333   9  993.11    7  922111  07  45.443

Start form defining a function to reformat a single cell :

def refCell(cell, ind1):
    tbl = cell.split(',')
    ind2 = [ 'x' + str(i) for i in range(1, len(tbl) + 1) ]
    ind = pd.MultiIndex.from_product([[ind1], ind2])
    return pd.Series(tbl, index=ind)

It creates a Series with values resulting from splitting a cell, with a MultiIndex, where:

  • The first level is ind1 .
  • The second level is x1 , x2 and so on (a list of strings).

The second function to define is a function to reformat a row :

def refRow(row):
    return pd.concat([ refCell(val, idx) for idx, val in row.iteritems() ])

Then, to get the result, apply this function (to each row):

df.apply(refRow, axis=1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM