简体   繁体   English

使用Pandas DataFrame进行部分多索引

[英]Partial Multiindexing with a Pandas DataFrame

I have a dataframe as follows: 我有一个数据框,如下所示:

df = pd.DataFrame(columns=['New Category', 'Sample1', 'Sample2'],
         data=[
               ['Pathogenic/Likely Pathogenic', '0/0:240', '1/0:100'],
               ['Likely Benign', '1/1:0,237', '1/0:700'],
               ['Likely Benign', '0/0:239', '0/0:234'],
               ['Likely Benign', '1/1:1,238', '0/1:890'],
               ['Likely Benign', '0/1:156,79', '1/1:767'],
               ['VUS', '1/1:0,241', '0/1:21']
               ])

Which looks like this: 看起来像这样:

               New Category       Sample1   Sample2
0  Pathogenic/Likely Pathogenic   0/0:240   1/0:100
1                 Likely Benign   1/1:237   1/0:700
2                 Likely Benign   0/0:239   0/0:234
3                 Likely Benign   1/1:238   0/1:890
4                 Likely Benign   0/1:156   1/1:767
5                           VUS   1/1:241   0/1:21

I want to do some multiindexing so that the Sample1 and Sample2 values are split by the colon and placed underneath as a sub-column name. 我想做一些多索引操作,以使Sample1和Sample2值被冒号分开并作为子列名称放在下面。 However, I do not want these sub-column names to apply to the New Category column. 但是,我不希望这些子列名称适用于“新类别”列。 Basically I want it to look like this: 基本上我希望它看起来像这样:

               New Category       Sample1   Sample2
                                  GT   GQ    GT   GQ
0  Pathogenic/Likely Pathogenic   0/0  240   1/0  100
1                 Likely Benign   1/1  237   1/0  700
2                 Likely Benign   0/0  239   0/0  234
3                 Likely Benign   1/1  238   0/1  890
4                 Likely Benign   0/1  156   1/1  767
5                           VUS   1/1  241   0/1  21

I really am stumped on how to do this. 我真的对如何做到这一点感到困惑。 The multiindexing page of the pandas docs contains no example of multiindexing on selected columns only. pandas文档的multiindexing页面仅在选定列上没有包含multiindexing示例。 This is making we wonder whether this is even possible. 这使我们怀疑这是否可能。

This is not really a matter of " indexing ", but rather of manipulating data, in particular splitting the columns. 这实际上不是“ 索引 ”问题,而是操作数据,尤其是拆分列。 The following should do: 应该执行以下操作:

df_new_category = pd.DataFrame(
    df[['New Category']].values,
    columns=pd.MultiIndex.from_tuples([('New Category', '')])
)
sample_data_dfs = \
    [pd.DataFrame(list(df[col].str.split(':')),
                  columns=pd.MultiIndex.from_product([[col], ['GT', 'GQ']]))
     for col in ['Sample1', 'Sample2']]

pd.concat([df_new_category] + sample_data_dfs, axis=1)

Notice that you could do the splitting all at once (ie without a loop on each column), like follows: 请注意,您可以一次全部拆分(即,每列上没有循环),如下所示:

df[['Sample1', 'Sample2']].applymap(lambda s : s.split(':'))

... but ...但是

  • this is way slower, because you are implicitly looping on every cell 这会比较慢,因为您隐式地在每个单元格上循环
  • you would still need another loop to extract the single newly created columns 您仍然需要另一个循环来提取新创建的单个列

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM