Pivot Pandas Dataframe with Duplicates using Masking

Question

A non-indexed df contains rows of gene, a cell that contains a mutation in that gene, and the type of mutation in that gene:

df = pd.DataFrame({'gene': ['one','one','one','two','two','two','three'],
                       'cell': ['A', 'A', 'C', 'A', 'B', 'C','A'],
                       'mutation': ['frameshift', 'missense', 'nonsense', '3UTR', '3UTR', '3UTR', '3UTR']})

df:

  cell   gene    mutation
0    A    one  frameshift
1    A    one    missense
2    C    one    nonsense
3    A    two        3UTR
4    B    two        3UTR
5    C    two        3UTR
6    A  three        3UTR

I'd like to pivot this df so I can index by gene and set columns to cells. The trouble is that there can be multiple entries per cell: there can be multiple mutations in any one gene in a given cell (cell A has two different mutations in gene One). So when I run:

df.pivot_table(index='gene', columns='cell', values='mutation')

this happens:

DataError: No numeric types to aggregate

I'd like to use masking to perform the pivot while capturing the presence of at least one mutation:

       A  B  C
gene          
one    1  1  1
two    0  1  0
three  1  1  0

Answer 1

The error message is not what is produced when you run pivot_table . You can have multiple values in the index for pivot_table . I don't believe this is true for the pivot method. You can however fix your problem by changing the aggregation to something that works on strings as opposed to numerics. Most aggregation functions operate on numeric columns and the code you wrote above would produce an error relating to the data type of the column not an index error.

df.pivot_table(index='gene',
               columns='cell',
               values='mutation',
               aggfunc='count', fill_value=0)

If you only want 1 value per cell you can do a groupby and aggregate everything to 1 and then unstack a level.

df.groupby(['cell', 'gene']).agg(lambda x: 1).unstack(fill_value=0)

Answer 2

Solution with drop_duplicates and pivot_table :

df = df.drop_duplicates(['cell','gene'])
       .pivot_table(index='gene', 
                    columns='cell', 
                    values='mutation',
                    aggfunc=len, 
                    fill_value=0)
print (df)
cell   A  B  C
gene          
one    1  0  1
three  1  0  0
two    1  1  1

Another solution with drop_duplicates , groupby with aggregate size and last reshape by unstack :

df = df.drop_duplicates(['cell','gene'])
       .groupby(['cell', 'gene'])
       .size()
       .unstack(0, fill_value=0)
print (df)
cell   A  B  C
gene          
one    1  0  1
three  1  0  0
two    1  1  1

Pivot Pandas Dataframe with Duplicates using Masking

Question

2 answers

solution1
1 2016-12-16 05:53:28

solution2
1 ACCPTED 2016-12-16 06:12:45

Pivot Pandas Dataframe with Duplicates using Masking

Question

2 answers

solution1 1 2016-12-16 05:53:28

solution2 1 ACCPTED 2016-12-16 06:12:45

solution1
1 2016-12-16 05:53:28

solution2
1 ACCPTED 2016-12-16 06:12:45