简体   繁体   English

格式化数据框减少列和增加行

[英]Formating a data frame reducing column and increasing rows

I have a pandas data frame like this我有一个这样的 pandas 数据框

data = [['NGS_34',0.47,0.55,0.51,0.53,0.54,0.62], ['NGS_38',0.52,0.52,0.49,0.51,0.52,0.45]]
 
df = pd.DataFrame(data, columns = ['probes','01_PLAGL1', '02_PLAGL1','H19','H19', 'H19','GNAS_A/B'])
df = df.set_index('probes')
df

             01_PLAGL1  02_PLAGL1   H19   H19   H19  GNAS_A/B
probes                                                  
NGS_34       0.47       0.55  0.51  0.53  0.54      0.62
NGS_38       0.52       0.52  0.49  0.51  0.52      0.45

This is actually a minimal reproducible example.这实际上是一个最小的可重现示例。 The real data frame is formed by many paired columns like the example 01_PLAGL1 02_PLAGL1 , then 2 sets of three columns like the example H19 H19 H19 and 2 unique columns.真实的数据框由许多成对的列组成,例如示例01_PLAGL1 02_PLAGL1 ,然后是 2 组三列,例如示例H19 H19 H19和 2 个唯一列。 With this explanation and the columns of my real dataset below, I think you will understand the input data of my problem.通过这个解释和下面我的真实数据集的列,我想你会理解我的问题的输入数据。

data_no_control.columns.values

array(['PLAGL1', 'PLAGL1', 'GRB10', 'GRB10', 'MEST', 'MEST', 'H19', 'H19',
       'H19', 'KCNQ1OT1', 'KCNQ1OT1', 'MEG3', 'MEG3', 'MEG8', 'MEG8',
       'SNRPN', 'SNRPN', 'PEG3', 'PEG3', 'PEG3', 'NESP55', 'GNAS-AS1',
       'GNASXL', 'GNASXL', 'GNAS_A/B'], dtype=object)

The final output I would like to achieve should be like this最后我想实现的output应该是这样的

            01_PLAGL1     H19      GNAS A/B
probes                                                  
NGS_34       0.47         0.51      0.62
             0.55         0.53
                          0.54
(One empty row)
(Second empty row)
NGS_38       0.52         0.49      0.45
             0.52         0.51
                          0.52
(One empty row)
(Second empty row)
NGS_41 ...

I have tried this我试过这个

df = data_no_control.reset_index(level=0)


empty_rows = 5
df.index = range(0, empty_rows*len(df), empty_rows)
new_df = df.reindex(index=range(empty_rows*len(df)))

new_df = new_df.set_index('index')

new_df

index        01_PLAGL1  02_PLAGL1   H19   H19   H19  GNAS_A/B
                                                  
NGS_34       0.47       0.55  0.51  0.53  0.54      0.62
NaN          NaN         NaN  NaN    NaN. NaN       NaN
NaN          NaN         NaN  NaN    NaN. NaN       NaN
NaN          NaN         NaN  NaN    NaN. NaN       NaN
NaN          NaN         NaN  NaN    NaN. NaN       NaN
NGS_38       0.52       0.52  0.49  0.51  0.52      0.45
NaN          NaN         NaN  NaN    NaN. NaN       NaN
NaN          NaN         NaN  NaN    NaN. NaN       NaN
NaN          NaN         NaN  NaN    NaN. NaN       NaN
NaN          NaN         NaN  NaN    NaN. NaN       NaN

Use:利用:

data = [['NGS_34',0.47,0.55,0.51,0.53,0.54,0.62], ['NGS_38',0.52,0.52,0.49,0.51,0.52,0.45]]

df = pd.DataFrame(data, columns = ['probes','01_PLAGL1', '02_PLAGL1','H19','H19', 'H19','GNAS_A/B'])
df = df.set_index('probes')

#No of new rows
new = 2
#reove values before _ for pairs columns names
s = df.columns.str.split('_').str[-1].to_series()
#create Multiindex by counter
df.columns = [s, s.groupby(s).cumcount()]
#reshape
df = df.stack()
#create MultiIndex for add new rows and original order in columns names
mux = pd.MultiIndex.from_product([df.index.levels[0],
                                  np.arange(df.index.levels[1].max() + new + 1)])
df = df.reindex(index=mux, columns=s.unique())
print (df)
          PLAGL1   H19   A/B
probes                      
NGS_34 0    0.47  0.51  0.62
       1    0.55  0.53   NaN
       2     NaN  0.54   NaN
       3     NaN   NaN   NaN
       4     NaN   NaN   NaN
NGS_38 0    0.52  0.49  0.45
       1    0.52  0.51   NaN
       2     NaN  0.52   NaN
       3     NaN   NaN   NaN
       4     NaN   NaN   NaN

Last if need empty values instead misisng values and no counter level use:最后如果需要空值而不是 misisng 值并且没有计数器级别使用:

df = df.droplevel(1).fillna('')
df.index = df.index.where(~df.index.duplicated(), '')
print (df)
       PLAGL1   H19   A/B
probes                   
NGS_34   0.47  0.51  0.62
         0.55  0.53      
               0.54      
                         
                         
NGS_38   0.52  0.49  0.45
         0.52  0.51      
               0.52      
                         

EDIT: In real data are not duplicates, so ouput is different:编辑:在实际数据中不重复,所以输出不同:

d = {'PLAGL1': {'NGS_34': 0.55, 'NGS_38': 0.52}, 'GRB10': {'NGS_34': 0.48, 'NGS_38': 0.49}, 'MEST': {'NGS_34': 0.56, 'NGS_38': 0.5}, 'H19': {'NGS_34': 0.54, 'NGS_38': 0.52}, 'KCNQ1OT1': {'NGS_34': 0.41, 'NGS_38': 0.49}, 'MEG3': {'NGS_34': 0.5, 'NGS_38': 0.55}, 'MEG8': {'NGS_34': 0.46, 'NGS_38': 0.5}, 'SNRPN': {'NGS_34': 0.55, 'NGS_38': 0.46}, 'PEG3': {'NGS_34': 0.51, 'NGS_38': 0.51}, 'NESP55': {'NGS_34': 0.55, 'NGS_38': 0.53}, 'GNAS-AS1': {'NGS_34': 0.52, 'NGS_38': 0.48}, 'GNASXL': {'NGS_34': 0.49, 'NGS_38': 0.44}, 'GNAS A/B': {'NGS_34': 0.62, 'NGS_38': 0.45}}

df = pd.DataFrame(d)
print (df)

        PLAGL1  GRB10  MEST   H19  KCNQ1OT1  MEG3  MEG8  SNRPN  PEG3  NESP55  \
NGS_34    0.55   0.48  0.56  0.54      0.41  0.50  0.46   0.55  0.51    0.55   
NGS_38    0.52   0.49  0.50  0.52      0.49  0.55  0.50   0.46  0.51    0.53   

        GNAS-AS1  GNASXL  GNAS A/B  
NGS_34      0.52    0.49      0.62  
NGS_38      0.48    0.44      0.45  

#No of new rows
new = 2
#reove values before _ for pairs columns names
s = df.columns.str.split('_').str[-1].to_series()
#create Multiindex by counter
df.columns = [s, s.groupby(s).cumcount()]
#reshape
df = df.stack()
#create MultiIndex for add new rows and original order in columns names
mux = pd.MultiIndex.from_product([df.index.levels[0],
                                  np.arange(df.index.levels[1].max() + new + 1)])
df = df.reindex(index=mux, columns=s.unique())

print (df)
          PLAGL1  GRB10  MEST   H19  KCNQ1OT1  MEG3  MEG8  SNRPN  PEG3  \
NGS_34 0    0.55   0.48  0.56  0.54      0.41  0.50  0.46   0.55  0.51   
       1     NaN    NaN   NaN   NaN       NaN   NaN   NaN    NaN   NaN   
       2     NaN    NaN   NaN   NaN       NaN   NaN   NaN    NaN   NaN   
NGS_38 0    0.52   0.49  0.50  0.52      0.49  0.55  0.50   0.46  0.51   
       1     NaN    NaN   NaN   NaN       NaN   NaN   NaN    NaN   NaN   
       2     NaN    NaN   NaN   NaN       NaN   NaN   NaN    NaN   NaN   

          NESP55  GNAS-AS1  GNASXL  GNAS A/B  
NGS_34 0    0.55      0.52    0.49      0.62  
       1     NaN       NaN     NaN       NaN  
       2     NaN       NaN     NaN       NaN  
NGS_38 0    0.53      0.48    0.44      0.45  
       1     NaN       NaN     NaN       NaN  
       2     NaN       NaN     NaN       NaN  

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 数据框列格式 - data frame columns formating 使用数据框的列值来索引多索引数据框的行 - Using column values of a data frame to index rows of a multiindex data frame 如何通过 pyspark 中的列向另一个数据帧中的数据帧添加行 - how to add rows to a data frame that are in another data frame by a column in pyspark 按列值复制 Pandas 数据框中的行 - Replicating rows in a pandas data frame by a column value 识别数据框中不断增加的功能 - Identify increasing features in a data frame 在 pandas 数据帧中,是否有一种有效的方法可以将连续增加的数据行分类为一组 - Is there an efficient way to categorise rows of sequential increasing data into a group in a pandas data frame 减少熊猫DataFrame的列中的行以进行绘图 - Reducing rows in a column for a panda DataFrame for plotting 将一组数据框行的列值转换为列中的列表 - Convert column values for a group of data frame rows into a list in the column 给定一个数据框,如何检查列的值按递增顺序排列而没有任何丢失的数字? - How can I check, given a data frame that the values of a column are in increasing order without any missing number? 根据数据框中 Opportunity 的递增值,为从红色到绿色的颜色代码添加一个新列 - Add a new column for color code from red to green based on the increasing value of Opportunity in data frame
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM