简体   繁体   中英

Pivot Pandas Dataframe with Duplicates using Masking

A non-indexed df contains rows of gene, a cell that contains a mutation in that gene, and the type of mutation in that gene:

df = pd.DataFrame({'gene': ['one','one','one','two','two','two','three'],
                       'cell': ['A', 'A', 'C', 'A', 'B', 'C','A'],
                       'mutation': ['frameshift', 'missense', 'nonsense', '3UTR', '3UTR', '3UTR', '3UTR']})

df:

  cell   gene    mutation
0    A    one  frameshift
1    A    one    missense
2    C    one    nonsense
3    A    two        3UTR
4    B    two        3UTR
5    C    two        3UTR
6    A  three        3UTR

I'd like to pivot this df so I can index by gene and set columns to cells. The trouble is that there can be multiple entries per cell: there can be multiple mutations in any one gene in a given cell (cell A has two different mutations in gene One). So when I run:

df.pivot_table(index='gene', columns='cell', values='mutation')

this happens:

DataError: No numeric types to aggregate

I'd like to use masking to perform the pivot while capturing the presence of at least one mutation:

       A  B  C
gene          
one    1  1  1
two    0  1  0
three  1  1  0

The error message is not what is produced when you run pivot_table . You can have multiple values in the index for pivot_table . I don't believe this is true for the pivot method. You can however fix your problem by changing the aggregation to something that works on strings as opposed to numerics. Most aggregation functions operate on numeric columns and the code you wrote above would produce an error relating to the data type of the column not an index error.

df.pivot_table(index='gene',
               columns='cell',
               values='mutation',
               aggfunc='count', fill_value=0)

If you only want 1 value per cell you can do a groupby and aggregate everything to 1 and then unstack a level.

df.groupby(['cell', 'gene']).agg(lambda x: 1).unstack(fill_value=0)

Solution with drop_duplicates and pivot_table :

df = df.drop_duplicates(['cell','gene'])
       .pivot_table(index='gene', 
                    columns='cell', 
                    values='mutation',
                    aggfunc=len, 
                    fill_value=0)
print (df)
cell   A  B  C
gene          
one    1  0  1
three  1  0  0
two    1  1  1

Another solution with drop_duplicates , groupby with aggregate size and last reshape by unstack :

df = df.drop_duplicates(['cell','gene'])
       .groupby(['cell', 'gene'])
       .size()
       .unstack(0, fill_value=0)
print (df)
cell   A  B  C
gene          
one    1  0  1
three  1  0  0
two    1  1  1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM