简体   繁体   English

Pandas-拆分多索引数据帧

[英]Pandas- split a multiple index dataframe

I have a multiple indexed pandas data frame where I want to separate values by '||' 我有一个多索引的pandas数据框,我想用“||”分隔值 character and include one more layer of index with three new columns 'Connection', 'Val1' 'Val2'. 字符并包含一个索引层,其中包含三个新列'Connection','Val1''Val2'。

Will be helpful if someone could give me few hints to do this. 如果有人可以给我一些提示来做这件事会有帮助。

Current example Data Frame: 当前示例数据框:

                              Experiment1                        Experiment2      
    Target      Analyze1_ab               Analyze2_zz            Analyze1_yy        
    XXX_1   Edge2||3.1E-07||-0.5    Edge2||2.1E-06||-0.9     Edge2||6.4E-02||-0.3   
    XXX_4   Edge1||6.4E-12||1.1     Edge1||2.4E-11||9.4      Edge1||1.4E-11||1.4    
    ABC_1   Edge1||3.9E-07||0.7     Edge1||2.9E-07||5.6      Edge1||6.8E-02||0.4    
    ABC_2   Edge2||1.1E-09||-0.5    Edge2||1.2E-09||1.2      Edge2||1.0E-03||-0.5   
    ABC_3   Edge2||4.6E-25||-0.8    Edge2||2.6E-10||1.9      Edge2||5.0E-17||-0.9   
    XXX_2   Edge2||1.7E-07||-0.5    Edge2||5.7E-08||-0.3     Edge2||4.1E-02||-0.3   
    ABC_4   Edge1||8.1E-02||0.5     Edge1||9.1E-02||1.5      Edge1||5.4E-02||0.6    
    ABC_5   Edge1||6.7E-02||0.3     Edge1||4.2E-02||1.9      Edge1||5.6E-03||0.4    
    XXX_3   Edge2||3.1E-03||-0.4    Edge1||2.4E-11||1.1      Edge2||2.4E-02||-0.3   

Desired Data Frame: 期望的数据框架:

                                   Experiment1                            Experiment2       
    Target          Analyze1_ab                  Analyze2_zz                Analyze1_yy     
           Connection   Val1    Val2    Connection  Val1    Val2    Connection  Val1    Val2
    XXX_1   Edge2   3.10E-07    -0.5    Edge2   2.10E-06    -0.9    Edge2   6.40E-02    -0.3
    XXX_4   Edge1   6.40E-12    1.1     Edge1   2.40E-11    9.4     Edge1   1.40E-11    1.4
    ABC_1   Edge1   3.90E-07    0.7     Edge1   2.90E-07    5.6     Edge1   6.80E-02    0.4
    ABC_2   Edge2   1.10E-09    -0.5    Edge2   1.20E-09    1.2     Edge2   1.00E-03    -0.5
    ABC_3   Edge2   4.60E-25    -0.8    Edge2   2.60E-10    1.9     Edge2   5.00E-17    -0.9
    XXX_2   Edge2   1.70E-07    -0.5    Edge2   5.70E-08    -0.3    Edge2   4.10E-02    -0.3
    ABC_4   Edge1   8.10E-02    0.5     Edge1   9.10E-02    1.5     Edge1   5.40E-02    0.6
    ABC_5   Edge1   6.70E-02    0.3     Edge1   4.20E-02    1.9     Edge1   5.60E-03    0.4
    XXX_3   Edge2   3.10E-03    -0.4    Edge1   2.40E-11    1.1     Edge2   2.40E-02    -0.3

Setup 建立

df
Out[2319]: 
      ID           Experiment1                                 Experiment2
  Target           Analyze1_ab           Analyze2_ab           Analyze1_yy
0  XXX_1  Edge2||3.1E-07||-0.5  Edge2||2.1E-06||-0.9  Edge2||6.4E-02||-0.3
1  XXX_4   Edge1||6.4E-12||1.1   Edge1||2.4E-11||9.4   Edge1||1.4E-11||1.4
2  ABC_1   Edge1||3.9E-07||0.7   Edge1||2.9E-07||5.6   Edge1||6.8E-02||0.4
3  ABC_2  Edge2||1.1E-09||-0.5   Edge2||1.2E-09||1.2  Edge2||1.0E-03||-0.5
4  ABC_3  Edge2||4.6E-25||-0.8   Edge2||2.6E-10||1.9  Edge2||5.0E-17||-0.9
5  XXX_2  Edge2||1.7E-07||-0.5  Edge2||5.7E-08||-0.3  Edge2||4.1E-02||-0.3
6  ABC_4   Edge1||8.1E-02||0.5   Edge1||9.1E-02||1.5   Edge1||5.4E-02||0.6
7  ABC_5   Edge1||6.7E-02||0.3   Edge1||4.2E-02||1.9   Edge1||5.6E-03||0.4
8  XXX_3  Edge2||3.1E-03||-0.4   Edge1||2.4E-11||1.1  Edge2||2.4E-02||-0.3

Solution

#split columns by '||' and rebuild a Dataframe with the separated columns
df2 = pd.DataFrame(np.asarray(df.iloc[:,1:].apply(lambda x: x.str.split('\|\|')).values.tolist()).reshape(9,-1))
#set Multilevel columns
df2.columns=pd.MultiIndex.from_tuples([('Experiment1','Analyze1_ab','Connection'),
('Experiment1','Analyze1_ab','Val1'),
('Experiment1','Analyze1_ab','Val2'),
('Experiment1','Analyze2_zz','Connection'),
('Experiment1','Analyze2_zz','Val1'),
('Experiment1','Analyze2_zz','Val2'),
('Experiment2','Analyze1_yy','Connection'),
('Experiment2','Analyze1_yy','Val1'),
('Experiment2','Analyze1_yy','Val2')])
#add Target column
df2.insert(0,'Target',df.iloc[:,0])

Out[2324]: 
  Target Experiment1                                           Experiment2  
         Analyze1_ab                Analyze2_zz                Analyze1_yy   
          Connection     Val1  Val2  Connection     Val1  Val2  Connection     Val1  Val2 
0  XXX_1       Edge2  3.1E-07  -0.5       Edge2  2.1E-06  -0.9       Edge2  6.4E-02  -0.3 
1  XXX_4       Edge1  6.4E-12   1.1       Edge1  2.4E-11   9.4       Edge1  1.4E-11   1.4 
2  ABC_1       Edge1  3.9E-07   0.7       Edge1  2.9E-07   5.6       Edge1  6.8E-02   0.4 
3  ABC_2       Edge2  1.1E-09  -0.5       Edge2  1.2E-09   1.2       Edge2  1.0E-03  -0.5 
4  ABC_3       Edge2  4.6E-25  -0.8       Edge2  2.6E-10   1.9       Edge2  5.0E-17  -0.9 
5  XXX_2       Edge2  1.7E-07  -0.5       Edge2  5.7E-08  -0.3       Edge2  4.1E-02  -0.3 
6  ABC_4       Edge1  8.1E-02   0.5       Edge1  9.1E-02   1.5       Edge1  5.4E-02   0.6 
7  ABC_5       Edge1  6.7E-02   0.3       Edge1  4.2E-02   1.9       Edge1  5.6E-03   0.4 
8  XXX_3       Edge2  3.1E-03  -0.4       Edge1  2.4E-11   1.1       Edge2  2.4E-02  -0.3
import pandas as pd

# Initialize DataFrame
# -----------------------------------------------------------------------------

df = pd.DataFrame({
    'Analyze1_ab': ['Edge2||3.1E-07||-0.5', 'Edge1||6.4E-12||1.1'],
    'Analyze2_zz': ['Edge2||2.1E-06||-0.9', 'Edge1||2.4E-11||9.4'],
    'Analyze1_yy': ['Edge2||6.4E-02||-0.3', 'Edge1||1.4E-11||1.4'],
    'Target': ['XXX_1', 'XXX_4'],})

df.columns = pd.MultiIndex.from_tuples(
    [('Experiment1', 'Analyze1_ab'),
     ('Experiment2', 'Analyze1_yy'),
     ('Experiment1', 'Analyze2_zz'),
     ('Target', '')])

# Split 'Analyses' columns by double pipes ||
# -----------------------------------------------------------------------------

# Initialize final DataFrame
final_df = pd.DataFrame()

for col_name in df.columns:
    if (col_name[1].startswith('Analyze') and
            df[col_name].str.contains('||').all()):
        # Split 'Analysis' by || into new columns
        splitted_analysis = df[col_name].str.split('\|\|', expand=True)
        # The new column names are 0, 1, 2. Let's rename them.
        splitted_analysis.columns = ['Connection', 'Val1', 'Val2']
        # Recreate MultiIndex
        splitted_analysis.columns = pd.MultiIndex.from_tuples(
            [(col_name[0], col_name[1], c) for c in splitted_analysis.columns])
        # Concatenate the new columns to the final_df
        final_df = pd.concat(objs=[final_df, splitted_analysis], axis=1)

# Add 'Target' column in the final_df.
# First, extract it.
target_col = pd.DataFrame(df[('Target', '')])
# Then, increase MultiIndex level of 'Target' from 2 to 3,
# to allow smooth concatenation with the final_df.
target_col.columns = pd.MultiIndex.from_tuples([('Target', '', '')])
final_df = pd.concat([final_df, target_col], axis=1)

Validation: print(final_df) : 验证: print(final_df)

  Experiment1                Experiment2                Experiment1                Target
  Analyze1_ab                Analyze1_yy                Analyze2_zz                      
   Connection     Val1  Val2  Connection     Val1  Val2  Connection     Val1  Val2       
0       Edge2  3.1E-07  -0.5       Edge2  6.4E-02  -0.3       Edge2  2.1E-06  -0.9  XXX_1
1       Edge1  6.4E-12   1.1       Edge1  1.4E-11   1.4       Edge1  2.4E-11   9.4  XXX_4

Validation: pprint.pprint([c for c in final_df.columns]) : 验证: pprint.pprint([c for c in final_df.columns])

[('Experiment1', 'Analyze1_ab', 'Connection'),
 ('Experiment1', 'Analyze1_ab', 'Val1'),
 ('Experiment1', 'Analyze1_ab', 'Val2'),
 ('Experiment2', 'Analyze1_yy', 'Connection'),
 ('Experiment2', 'Analyze1_yy', 'Val1'),
 ('Experiment2', 'Analyze1_yy', 'Val2'),
 ('Experiment1', 'Analyze2_zz', 'Connection'),
 ('Experiment1', 'Analyze2_zz', 'Val1'),
 ('Experiment1', 'Analyze2_zz', 'Val2'),
 ('Target', '', '')]

You can do it this way: 你可以这样做:

In [319]: x = df.iloc[:, 1:]
     ...: x.columns = list('abc')
     ...: x = x.stack().str.extractall('([^\|]+)').unstack([-1, -2])
     ...: x.columns = range(9)
     ...:
     ...: x = df.iloc[:, 0].to_frame('Target').join(x)
     ...:
     ...: x.columns = \
     ...: pd.MultiIndex.from_tuples([('', 'Target', '')]) \
     ...:   .union(pd.MultiIndex.from_tuples(
     ...:             [(t[0][0], t[0][1], t[1])
     ...:              for t in product(df.columns[1:].to_series().tolist(),
     ...:              ['Connection', 'Val1', 'Val2'])
     ...:             ]))
     ...:

Result: 结果:

In [320]: x
Out[320]:
         Experiment1                                           Experiment2
  Target Analyze1_ab                Analyze2_zz                Analyze1_yy
          Connection     Val1  Val2  Connection     Val1  Val2  Connection     Val1  Val2
0  XXX_1       Edge2  3.1E-07  -0.5       Edge2  2.1E-06  -0.9       Edge2  6.4E-02  -0.3
1  XXX_4       Edge1  6.4E-12   1.1       Edge1  2.4E-11   9.4       Edge1  1.4E-11   1.4
2  ABC_1       Edge1  3.9E-07   0.7       Edge1  2.9E-07   5.6       Edge1  6.8E-02   0.4
3  ABC_2       Edge2  1.1E-09  -0.5       Edge2  1.2E-09   1.2       Edge2  1.0E-03  -0.5
4  ABC_3       Edge2  4.6E-25  -0.8       Edge2  2.6E-10   1.9       Edge2  5.0E-17  -0.9
5  XXX_2       Edge2  1.7E-07  -0.5       Edge2  5.7E-08  -0.3       Edge2  4.1E-02  -0.3
6  ABC_4       Edge1  8.1E-02   0.5       Edge1  9.1E-02   1.5       Edge1  5.4E-02   0.6
7  ABC_5       Edge1  6.7E-02   0.3       Edge1  4.2E-02   1.9       Edge1  5.6E-03   0.4
8  XXX_3       Edge2  3.1E-03  -0.4       Edge1  2.4E-11   1.1       Edge2  2.4E-02  -0.3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM