[英]Formating a data frame reducing column and increasing rows
I have a pandas data frame like this我有一个这样的 pandas 数据框
data = [['NGS_34',0.47,0.55,0.51,0.53,0.54,0.62], ['NGS_38',0.52,0.52,0.49,0.51,0.52,0.45]]
df = pd.DataFrame(data, columns = ['probes','01_PLAGL1', '02_PLAGL1','H19','H19', 'H19','GNAS_A/B'])
df = df.set_index('probes')
df
01_PLAGL1 02_PLAGL1 H19 H19 H19 GNAS_A/B
probes
NGS_34 0.47 0.55 0.51 0.53 0.54 0.62
NGS_38 0.52 0.52 0.49 0.51 0.52 0.45
This is actually a minimal reproducible example.这实际上是一个最小的可重现示例。 The real data frame is formed by many paired columns like the example 01_PLAGL1
02_PLAGL1
, then 2 sets of three columns like the example H19
H19
H19
and 2 unique columns.真实的数据框由许多成对的列组成,例如示例01_PLAGL1
02_PLAGL1
,然后是 2 组三列,例如示例H19
H19
H19
和 2 个唯一列。 With this explanation and the columns of my real dataset below, I think you will understand the input data of my problem.通过这个解释和下面我的真实数据集的列,我想你会理解我的问题的输入数据。
data_no_control.columns.values
array(['PLAGL1', 'PLAGL1', 'GRB10', 'GRB10', 'MEST', 'MEST', 'H19', 'H19',
'H19', 'KCNQ1OT1', 'KCNQ1OT1', 'MEG3', 'MEG3', 'MEG8', 'MEG8',
'SNRPN', 'SNRPN', 'PEG3', 'PEG3', 'PEG3', 'NESP55', 'GNAS-AS1',
'GNASXL', 'GNASXL', 'GNAS_A/B'], dtype=object)
The final output I would like to achieve should be like this最后我想实现的output应该是这样的
01_PLAGL1 H19 GNAS A/B
probes
NGS_34 0.47 0.51 0.62
0.55 0.53
0.54
(One empty row)
(Second empty row)
NGS_38 0.52 0.49 0.45
0.52 0.51
0.52
(One empty row)
(Second empty row)
NGS_41 ...
I have tried this我试过这个
df = data_no_control.reset_index(level=0)
empty_rows = 5
df.index = range(0, empty_rows*len(df), empty_rows)
new_df = df.reindex(index=range(empty_rows*len(df)))
new_df = new_df.set_index('index')
new_df
index 01_PLAGL1 02_PLAGL1 H19 H19 H19 GNAS_A/B
NGS_34 0.47 0.55 0.51 0.53 0.54 0.62
NaN NaN NaN NaN NaN. NaN NaN
NaN NaN NaN NaN NaN. NaN NaN
NaN NaN NaN NaN NaN. NaN NaN
NaN NaN NaN NaN NaN. NaN NaN
NGS_38 0.52 0.52 0.49 0.51 0.52 0.45
NaN NaN NaN NaN NaN. NaN NaN
NaN NaN NaN NaN NaN. NaN NaN
NaN NaN NaN NaN NaN. NaN NaN
NaN NaN NaN NaN NaN. NaN NaN
Use:利用:
data = [['NGS_34',0.47,0.55,0.51,0.53,0.54,0.62], ['NGS_38',0.52,0.52,0.49,0.51,0.52,0.45]]
df = pd.DataFrame(data, columns = ['probes','01_PLAGL1', '02_PLAGL1','H19','H19', 'H19','GNAS_A/B'])
df = df.set_index('probes')
#No of new rows
new = 2
#reove values before _ for pairs columns names
s = df.columns.str.split('_').str[-1].to_series()
#create Multiindex by counter
df.columns = [s, s.groupby(s).cumcount()]
#reshape
df = df.stack()
#create MultiIndex for add new rows and original order in columns names
mux = pd.MultiIndex.from_product([df.index.levels[0],
np.arange(df.index.levels[1].max() + new + 1)])
df = df.reindex(index=mux, columns=s.unique())
print (df)
PLAGL1 H19 A/B
probes
NGS_34 0 0.47 0.51 0.62
1 0.55 0.53 NaN
2 NaN 0.54 NaN
3 NaN NaN NaN
4 NaN NaN NaN
NGS_38 0 0.52 0.49 0.45
1 0.52 0.51 NaN
2 NaN 0.52 NaN
3 NaN NaN NaN
4 NaN NaN NaN
Last if need empty values instead misisng values and no counter level use:最后如果需要空值而不是 misisng 值并且没有计数器级别使用:
df = df.droplevel(1).fillna('')
df.index = df.index.where(~df.index.duplicated(), '')
print (df)
PLAGL1 H19 A/B
probes
NGS_34 0.47 0.51 0.62
0.55 0.53
0.54
NGS_38 0.52 0.49 0.45
0.52 0.51
0.52
EDIT: In real data are not duplicates, so ouput is different:编辑:在实际数据中不重复,所以输出不同:
d = {'PLAGL1': {'NGS_34': 0.55, 'NGS_38': 0.52}, 'GRB10': {'NGS_34': 0.48, 'NGS_38': 0.49}, 'MEST': {'NGS_34': 0.56, 'NGS_38': 0.5}, 'H19': {'NGS_34': 0.54, 'NGS_38': 0.52}, 'KCNQ1OT1': {'NGS_34': 0.41, 'NGS_38': 0.49}, 'MEG3': {'NGS_34': 0.5, 'NGS_38': 0.55}, 'MEG8': {'NGS_34': 0.46, 'NGS_38': 0.5}, 'SNRPN': {'NGS_34': 0.55, 'NGS_38': 0.46}, 'PEG3': {'NGS_34': 0.51, 'NGS_38': 0.51}, 'NESP55': {'NGS_34': 0.55, 'NGS_38': 0.53}, 'GNAS-AS1': {'NGS_34': 0.52, 'NGS_38': 0.48}, 'GNASXL': {'NGS_34': 0.49, 'NGS_38': 0.44}, 'GNAS A/B': {'NGS_34': 0.62, 'NGS_38': 0.45}}
df = pd.DataFrame(d)
print (df)
PLAGL1 GRB10 MEST H19 KCNQ1OT1 MEG3 MEG8 SNRPN PEG3 NESP55 \
NGS_34 0.55 0.48 0.56 0.54 0.41 0.50 0.46 0.55 0.51 0.55
NGS_38 0.52 0.49 0.50 0.52 0.49 0.55 0.50 0.46 0.51 0.53
GNAS-AS1 GNASXL GNAS A/B
NGS_34 0.52 0.49 0.62
NGS_38 0.48 0.44 0.45
#No of new rows
new = 2
#reove values before _ for pairs columns names
s = df.columns.str.split('_').str[-1].to_series()
#create Multiindex by counter
df.columns = [s, s.groupby(s).cumcount()]
#reshape
df = df.stack()
#create MultiIndex for add new rows and original order in columns names
mux = pd.MultiIndex.from_product([df.index.levels[0],
np.arange(df.index.levels[1].max() + new + 1)])
df = df.reindex(index=mux, columns=s.unique())
print (df)
PLAGL1 GRB10 MEST H19 KCNQ1OT1 MEG3 MEG8 SNRPN PEG3 \
NGS_34 0 0.55 0.48 0.56 0.54 0.41 0.50 0.46 0.55 0.51
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN
NGS_38 0 0.52 0.49 0.50 0.52 0.49 0.55 0.50 0.46 0.51
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN
NESP55 GNAS-AS1 GNASXL GNAS A/B
NGS_34 0 0.55 0.52 0.49 0.62
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
NGS_38 0 0.53 0.48 0.44 0.45
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.