[英]How to split multiple columns in Pandas
I have a data frame like below:我有一个如下所示的数据框:
df = pd.DataFrame({'var1': ['0,3788,99,20.88', '3,99022,08,91.995'],
'var2': ['0,929,92,299.90', '1,38333,9,993.11'],
'var3': ['8,9332,99,29.10', '7,922111,07,45.443']})
Out[248]:
var1 var2 var3
0 0,3788,99,20.88 0,929,92,299.90 8,9332,99,29.10
1 3,99022,08,91.995 1,38333,9,993.11 7,922111,07,45.443
I want to split each column on comma and same the new set of columns next to each other.我想用逗号分割每一列,并将新的一组列彼此相邻。 So the resulting data frame should look like below:
因此生成的数据框应如下所示:
df2 = pd.DataFrame({('var1', 'x1'): [0, 3], ('var1', 'x2'): [3788, 99022], ('var1', 'x3'): [99, '08'], ('var1', 'x4'): [20.88, 91.995],
('var2', 'x1'): [0, 1], ('var2', 'x2'): [929, 38333], ('var2', 'x3'): [92, 9], ('var2', 'x4'): [299.90, 993.11],
('var3', 'x1'): [8, 7], ('var3', 'x2'): [9332, 922111], ('var3', 'x3'): [99, '07'], ('var3', 'x4'): [29.10, 45.443]})
Out[249]:
var1 var2 var3
x1 x2 x3 x4 x1 x2 x3 x4 x1 x2 x3 x4
0 0 3788 99 20.880 0 929 92 299.90 8 9332 99 29.100
1 3 99022 08 91.995 1 38333 9 993.11 7 922111 07 45.443
The MultiIndex
is not mandatory, but then I'd like to have an opportunity to easily gather the data and obtain df3 if needed: MultiIndex
不是强制性的,但是如果需要,我希望有机会轻松收集数据并获取 df3:
var x1 x2 x3 x4
0 var1 0 3788 99 20.880
1 var1 3 99022 08 91.995
0 var2 0 929 92 299.900
1 var2 1 38333 9 993.110
0 var3 8 9332 99 29.100
1 var3 7 922111 07 45.443
My effort included pd.melt
and str.split
:我的努力包括
pd.melt
和str.split
:
df_long = pd.melt(df.reset_index(drop = False), id_vars = 'index', var_name = 'var', value_name = 'values') \
.sort_values(['index', 'var']) \
.set_index('index')
df_long = df_long['values'].str.split(',', expand = True)
df_long.columns = ['x' + str(i) for i in range(df_long.shape[1])]
But: 1) I don't know how to then spread the data for different var1, var2, var3...
next to each other 2) transforming from wide format to long format ( df
to df_long
) and back again ( df_long
to df3
) seems highly inefficient and I care for performance with the seeking solution.但是:1)我不知道如何然后将不同
var1, var2, var3...
的数据彼此相邻 2)从宽格式转换为长格式( df
到df_long
)并再次返回( df_long
到df3
) 似乎效率很低,我关心寻求解决方案的性能。
So what's the best way to transform from df
to df2
, so that we could then easily obtain df3
if needed?那么从
df
转换为df2
的最佳方法是什么,以便我们可以在需要时轻松获得df3
?
You can use stack
, str.split()
with expand=True
, unstack()
to achieve this:您可以使用
stack
、 str.split()
和expand=True
、 unstack()
来实现:
final=(df.stack().str.split(',',expand=True).unstack().swaplevel(axis=1)
.sort_index(level=0,axis=1))
print(final)
var1 var2 var3
0 1 2 3 0 1 2 3 0 1 2 3
0 0 3788 99 20.88 0 929 92 299.90 8 9332 99 29.10
1 3 99022 08 91.995 1 38333 9 993.11 7 922111 07 45.443
For renaming the 0th level of the columns, use;要重命名列的第 0 级,请使用;
final.columns=pd.MultiIndex.from_tuples([(a,f'x{b}') for a,b in final.columns])
var1 var2 var3
x0 x1 x2 x3 x0 x1 x2 x3 x0 x1 x2 x3
0 0 3788 99 20.88 0 929 92 299.90 8 9332 99 29.10
1 3 99022 08 91.995 1 38333 9 993.11 7 922111 07 45.443
You can also use the below for the second output shown in your question:您还可以将以下内容用于问题中显示的第二个输出:
df.stack().str.split(',',expand=True).add_prefix('x').reset_index(1).reset_index(drop=True)
level_1 x0 x1 x2 x3
0 var1 0 3788 99 20.88
1 var2 0 929 92 299.90
2 var3 8 9332 99 29.10
3 var1 3 99022 08 91.995
4 var2 1 38333 9 993.11
5 var3 7 922111 07 45.443
Here is an approach that gets df3 first:这是一种首先获取 df3 的方法:
df3 = pd.concat([df[s].str.split(',', expand=True).add_prefix("x").assign(var=s) for s in df])
print(df3)
x0 x1 x2 x3 var
0 0 3788 99 20.88 var1
1 3 99022 08 91.995 var1
0 0 929 92 299.90 var2
1 1 38333 9 993.11 var2
0 8 9332 99 29.10 var3
1 7 922111 07 45.443 var3
And then:进而:
df2 = df3.set_index("var", append=True).unstack().swaplevel(axis=1).sort_index(axis=1)
print(df2)
var var1 var2 var3
x0 x1 x2 x3 x0 x1 x2 x3 x0 x1 x2 x3
0 0 3788 99 20.88 0 929 92 299.90 8 9332 99 29.10
1 3 99022 08 91.995 1 38333 9 993.11 7 922111 07 45.443
Start form defining a function to reformat a single cell :开始定义一个函数来重新格式化单个单元格:
def refCell(cell, ind1):
tbl = cell.split(',')
ind2 = [ 'x' + str(i) for i in range(1, len(tbl) + 1) ]
ind = pd.MultiIndex.from_product([[ind1], ind2])
return pd.Series(tbl, index=ind)
It creates a Series with values resulting from splitting a cell, with a MultiIndex, where:它创建一个带有多个值的系列,这些值是通过拆分单元格产生的,具有 MultiIndex,其中:
The second function to define is a function to reformat a row :要定义的第二个函数是重新格式化行的函数:
def refRow(row):
return pd.concat([ refCell(val, idx) for idx, val in row.iteritems() ])
Then, to get the result, apply this function (to each row):然后,要获得结果,请应用此函数(对每一行):
df.apply(refRow, axis=1)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.