简体   繁体   English

使用蒙版重复Pivot Pandas Dataframe

[英]Pivot Pandas Dataframe with Duplicates using Masking

A non-indexed df contains rows of gene, a cell that contains a mutation in that gene, and the type of mutation in that gene: 未索引的df包含基因行,包含该基因突变的细胞以及该基因突变的类型:

df = pd.DataFrame({'gene': ['one','one','one','two','two','two','three'],
                       'cell': ['A', 'A', 'C', 'A', 'B', 'C','A'],
                       'mutation': ['frameshift', 'missense', 'nonsense', '3UTR', '3UTR', '3UTR', '3UTR']})

df: df:

  cell   gene    mutation
0    A    one  frameshift
1    A    one    missense
2    C    one    nonsense
3    A    two        3UTR
4    B    two        3UTR
5    C    two        3UTR
6    A  three        3UTR

I'd like to pivot this df so I can index by gene and set columns to cells. 我想旋转此df,以便我可以按基因索引并为细胞设置列。 The trouble is that there can be multiple entries per cell: there can be multiple mutations in any one gene in a given cell (cell A has two different mutations in gene One). 问题在于每个细胞可能有多个条目:给定细胞中的任何一个基因都可能存在多个突变(细胞A在一个基因中具有两个不同的突变)。 So when I run: 因此,当我运行时:

df.pivot_table(index='gene', columns='cell', values='mutation')

this happens: 有时候是这样的:

DataError: No numeric types to aggregate

I'd like to use masking to perform the pivot while capturing the presence of at least one mutation: 我想使用遮罩来执行数据透视,同时捕获至少一个突变的存在:

       A  B  C
gene          
one    1  1  1
two    0  1  0
three  1  1  0

The error message is not what is produced when you run pivot_table . 该错误消息不是您运行pivot_table时产生的。 You can have multiple values in the index for pivot_table . 您可以在数据pivot_table的索引中包含多个值。 I don't believe this is true for the pivot method. 我不认为这对于pivot方法是正确的。 You can however fix your problem by changing the aggregation to something that works on strings as opposed to numerics. 但是,您可以通过将聚合更改为适用于字符串而不是数字的内容来解决问题。 Most aggregation functions operate on numeric columns and the code you wrote above would produce an error relating to the data type of the column not an index error. 大多数聚合函数都在数字列上运行,并且您上面编写的代码将产生与列的数据类型有关的错误,而不是索引错误。

df.pivot_table(index='gene',
               columns='cell',
               values='mutation',
               aggfunc='count', fill_value=0)

If you only want 1 value per cell you can do a groupby and aggregate everything to 1 and then unstack a level. 如果每个单元格只需要1个值,则可以执行groupby并将所有内容汇总为1,然后取消堆叠级别。

df.groupby(['cell', 'gene']).agg(lambda x: 1).unstack(fill_value=0)

Solution with drop_duplicates and pivot_table : drop_duplicatespivot_table解决方案:

df = df.drop_duplicates(['cell','gene'])
       .pivot_table(index='gene', 
                    columns='cell', 
                    values='mutation',
                    aggfunc=len, 
                    fill_value=0)
print (df)
cell   A  B  C
gene          
one    1  0  1
three  1  0  0
two    1  1  1

Another solution with drop_duplicates , groupby with aggregate size and last reshape by unstack : 用另一种解决方案drop_duplicatesgroupby与总size由和最后重塑unstack

df = df.drop_duplicates(['cell','gene'])
       .groupby(['cell', 'gene'])
       .size()
       .unstack(0, fill_value=0)
print (df)
cell   A  B  C
gene          
one    1  0  1
three  1  0  0
two    1  1  1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM