[英]Pandas- split a multiple index dataframe
I have a multiple indexed pandas data frame where I want to separate values by '||' 我有一个多索引的pandas数据框,我想用“||”分隔值 character and include one more layer of index with three new columns 'Connection', 'Val1' 'Val2'.
字符并包含一个索引层,其中包含三个新列'Connection','Val1''Val2'。
Will be helpful if someone could give me few hints to do this. 如果有人可以给我一些提示来做这件事会有帮助。
Current example Data Frame: 当前示例数据框:
Experiment1 Experiment2
Target Analyze1_ab Analyze2_zz Analyze1_yy
XXX_1 Edge2||3.1E-07||-0.5 Edge2||2.1E-06||-0.9 Edge2||6.4E-02||-0.3
XXX_4 Edge1||6.4E-12||1.1 Edge1||2.4E-11||9.4 Edge1||1.4E-11||1.4
ABC_1 Edge1||3.9E-07||0.7 Edge1||2.9E-07||5.6 Edge1||6.8E-02||0.4
ABC_2 Edge2||1.1E-09||-0.5 Edge2||1.2E-09||1.2 Edge2||1.0E-03||-0.5
ABC_3 Edge2||4.6E-25||-0.8 Edge2||2.6E-10||1.9 Edge2||5.0E-17||-0.9
XXX_2 Edge2||1.7E-07||-0.5 Edge2||5.7E-08||-0.3 Edge2||4.1E-02||-0.3
ABC_4 Edge1||8.1E-02||0.5 Edge1||9.1E-02||1.5 Edge1||5.4E-02||0.6
ABC_5 Edge1||6.7E-02||0.3 Edge1||4.2E-02||1.9 Edge1||5.6E-03||0.4
XXX_3 Edge2||3.1E-03||-0.4 Edge1||2.4E-11||1.1 Edge2||2.4E-02||-0.3
Desired Data Frame: 期望的数据框架:
Experiment1 Experiment2
Target Analyze1_ab Analyze2_zz Analyze1_yy
Connection Val1 Val2 Connection Val1 Val2 Connection Val1 Val2
XXX_1 Edge2 3.10E-07 -0.5 Edge2 2.10E-06 -0.9 Edge2 6.40E-02 -0.3
XXX_4 Edge1 6.40E-12 1.1 Edge1 2.40E-11 9.4 Edge1 1.40E-11 1.4
ABC_1 Edge1 3.90E-07 0.7 Edge1 2.90E-07 5.6 Edge1 6.80E-02 0.4
ABC_2 Edge2 1.10E-09 -0.5 Edge2 1.20E-09 1.2 Edge2 1.00E-03 -0.5
ABC_3 Edge2 4.60E-25 -0.8 Edge2 2.60E-10 1.9 Edge2 5.00E-17 -0.9
XXX_2 Edge2 1.70E-07 -0.5 Edge2 5.70E-08 -0.3 Edge2 4.10E-02 -0.3
ABC_4 Edge1 8.10E-02 0.5 Edge1 9.10E-02 1.5 Edge1 5.40E-02 0.6
ABC_5 Edge1 6.70E-02 0.3 Edge1 4.20E-02 1.9 Edge1 5.60E-03 0.4
XXX_3 Edge2 3.10E-03 -0.4 Edge1 2.40E-11 1.1 Edge2 2.40E-02 -0.3
Setup 建立
df
Out[2319]:
ID Experiment1 Experiment2
Target Analyze1_ab Analyze2_ab Analyze1_yy
0 XXX_1 Edge2||3.1E-07||-0.5 Edge2||2.1E-06||-0.9 Edge2||6.4E-02||-0.3
1 XXX_4 Edge1||6.4E-12||1.1 Edge1||2.4E-11||9.4 Edge1||1.4E-11||1.4
2 ABC_1 Edge1||3.9E-07||0.7 Edge1||2.9E-07||5.6 Edge1||6.8E-02||0.4
3 ABC_2 Edge2||1.1E-09||-0.5 Edge2||1.2E-09||1.2 Edge2||1.0E-03||-0.5
4 ABC_3 Edge2||4.6E-25||-0.8 Edge2||2.6E-10||1.9 Edge2||5.0E-17||-0.9
5 XXX_2 Edge2||1.7E-07||-0.5 Edge2||5.7E-08||-0.3 Edge2||4.1E-02||-0.3
6 ABC_4 Edge1||8.1E-02||0.5 Edge1||9.1E-02||1.5 Edge1||5.4E-02||0.6
7 ABC_5 Edge1||6.7E-02||0.3 Edge1||4.2E-02||1.9 Edge1||5.6E-03||0.4
8 XXX_3 Edge2||3.1E-03||-0.4 Edge1||2.4E-11||1.1 Edge2||2.4E-02||-0.3
Solution 解
#split columns by '||' and rebuild a Dataframe with the separated columns
df2 = pd.DataFrame(np.asarray(df.iloc[:,1:].apply(lambda x: x.str.split('\|\|')).values.tolist()).reshape(9,-1))
#set Multilevel columns
df2.columns=pd.MultiIndex.from_tuples([('Experiment1','Analyze1_ab','Connection'),
('Experiment1','Analyze1_ab','Val1'),
('Experiment1','Analyze1_ab','Val2'),
('Experiment1','Analyze2_zz','Connection'),
('Experiment1','Analyze2_zz','Val1'),
('Experiment1','Analyze2_zz','Val2'),
('Experiment2','Analyze1_yy','Connection'),
('Experiment2','Analyze1_yy','Val1'),
('Experiment2','Analyze1_yy','Val2')])
#add Target column
df2.insert(0,'Target',df.iloc[:,0])
Out[2324]:
Target Experiment1 Experiment2
Analyze1_ab Analyze2_zz Analyze1_yy
Connection Val1 Val2 Connection Val1 Val2 Connection Val1 Val2
0 XXX_1 Edge2 3.1E-07 -0.5 Edge2 2.1E-06 -0.9 Edge2 6.4E-02 -0.3
1 XXX_4 Edge1 6.4E-12 1.1 Edge1 2.4E-11 9.4 Edge1 1.4E-11 1.4
2 ABC_1 Edge1 3.9E-07 0.7 Edge1 2.9E-07 5.6 Edge1 6.8E-02 0.4
3 ABC_2 Edge2 1.1E-09 -0.5 Edge2 1.2E-09 1.2 Edge2 1.0E-03 -0.5
4 ABC_3 Edge2 4.6E-25 -0.8 Edge2 2.6E-10 1.9 Edge2 5.0E-17 -0.9
5 XXX_2 Edge2 1.7E-07 -0.5 Edge2 5.7E-08 -0.3 Edge2 4.1E-02 -0.3
6 ABC_4 Edge1 8.1E-02 0.5 Edge1 9.1E-02 1.5 Edge1 5.4E-02 0.6
7 ABC_5 Edge1 6.7E-02 0.3 Edge1 4.2E-02 1.9 Edge1 5.6E-03 0.4
8 XXX_3 Edge2 3.1E-03 -0.4 Edge1 2.4E-11 1.1 Edge2 2.4E-02 -0.3
import pandas as pd
# Initialize DataFrame
# -----------------------------------------------------------------------------
df = pd.DataFrame({
'Analyze1_ab': ['Edge2||3.1E-07||-0.5', 'Edge1||6.4E-12||1.1'],
'Analyze2_zz': ['Edge2||2.1E-06||-0.9', 'Edge1||2.4E-11||9.4'],
'Analyze1_yy': ['Edge2||6.4E-02||-0.3', 'Edge1||1.4E-11||1.4'],
'Target': ['XXX_1', 'XXX_4'],})
df.columns = pd.MultiIndex.from_tuples(
[('Experiment1', 'Analyze1_ab'),
('Experiment2', 'Analyze1_yy'),
('Experiment1', 'Analyze2_zz'),
('Target', '')])
# Split 'Analyses' columns by double pipes ||
# -----------------------------------------------------------------------------
# Initialize final DataFrame
final_df = pd.DataFrame()
for col_name in df.columns:
if (col_name[1].startswith('Analyze') and
df[col_name].str.contains('||').all()):
# Split 'Analysis' by || into new columns
splitted_analysis = df[col_name].str.split('\|\|', expand=True)
# The new column names are 0, 1, 2. Let's rename them.
splitted_analysis.columns = ['Connection', 'Val1', 'Val2']
# Recreate MultiIndex
splitted_analysis.columns = pd.MultiIndex.from_tuples(
[(col_name[0], col_name[1], c) for c in splitted_analysis.columns])
# Concatenate the new columns to the final_df
final_df = pd.concat(objs=[final_df, splitted_analysis], axis=1)
# Add 'Target' column in the final_df.
# First, extract it.
target_col = pd.DataFrame(df[('Target', '')])
# Then, increase MultiIndex level of 'Target' from 2 to 3,
# to allow smooth concatenation with the final_df.
target_col.columns = pd.MultiIndex.from_tuples([('Target', '', '')])
final_df = pd.concat([final_df, target_col], axis=1)
Validation: print(final_df)
: 验证:
print(final_df)
:
Experiment1 Experiment2 Experiment1 Target
Analyze1_ab Analyze1_yy Analyze2_zz
Connection Val1 Val2 Connection Val1 Val2 Connection Val1 Val2
0 Edge2 3.1E-07 -0.5 Edge2 6.4E-02 -0.3 Edge2 2.1E-06 -0.9 XXX_1
1 Edge1 6.4E-12 1.1 Edge1 1.4E-11 1.4 Edge1 2.4E-11 9.4 XXX_4
Validation: pprint.pprint([c for c in final_df.columns])
: 验证:
pprint.pprint([c for c in final_df.columns])
:
[('Experiment1', 'Analyze1_ab', 'Connection'),
('Experiment1', 'Analyze1_ab', 'Val1'),
('Experiment1', 'Analyze1_ab', 'Val2'),
('Experiment2', 'Analyze1_yy', 'Connection'),
('Experiment2', 'Analyze1_yy', 'Val1'),
('Experiment2', 'Analyze1_yy', 'Val2'),
('Experiment1', 'Analyze2_zz', 'Connection'),
('Experiment1', 'Analyze2_zz', 'Val1'),
('Experiment1', 'Analyze2_zz', 'Val2'),
('Target', '', '')]
You can do it this way: 你可以这样做:
In [319]: x = df.iloc[:, 1:]
...: x.columns = list('abc')
...: x = x.stack().str.extractall('([^\|]+)').unstack([-1, -2])
...: x.columns = range(9)
...:
...: x = df.iloc[:, 0].to_frame('Target').join(x)
...:
...: x.columns = \
...: pd.MultiIndex.from_tuples([('', 'Target', '')]) \
...: .union(pd.MultiIndex.from_tuples(
...: [(t[0][0], t[0][1], t[1])
...: for t in product(df.columns[1:].to_series().tolist(),
...: ['Connection', 'Val1', 'Val2'])
...: ]))
...:
Result: 结果:
In [320]: x
Out[320]:
Experiment1 Experiment2
Target Analyze1_ab Analyze2_zz Analyze1_yy
Connection Val1 Val2 Connection Val1 Val2 Connection Val1 Val2
0 XXX_1 Edge2 3.1E-07 -0.5 Edge2 2.1E-06 -0.9 Edge2 6.4E-02 -0.3
1 XXX_4 Edge1 6.4E-12 1.1 Edge1 2.4E-11 9.4 Edge1 1.4E-11 1.4
2 ABC_1 Edge1 3.9E-07 0.7 Edge1 2.9E-07 5.6 Edge1 6.8E-02 0.4
3 ABC_2 Edge2 1.1E-09 -0.5 Edge2 1.2E-09 1.2 Edge2 1.0E-03 -0.5
4 ABC_3 Edge2 4.6E-25 -0.8 Edge2 2.6E-10 1.9 Edge2 5.0E-17 -0.9
5 XXX_2 Edge2 1.7E-07 -0.5 Edge2 5.7E-08 -0.3 Edge2 4.1E-02 -0.3
6 ABC_4 Edge1 8.1E-02 0.5 Edge1 9.1E-02 1.5 Edge1 5.4E-02 0.6
7 ABC_5 Edge1 6.7E-02 0.3 Edge1 4.2E-02 1.9 Edge1 5.6E-03 0.4
8 XXX_3 Edge2 3.1E-03 -0.4 Edge1 2.4E-11 1.1 Edge2 2.4E-02 -0.3
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.