A non-indexed df contains rows of gene, a cell that contains a mutation in that gene, and the type of mutation in that gene:
df = pd.DataFrame({'gene': ['one','one','one','two','two','two','three'],
'cell': ['A', 'A', 'C', 'A', 'B', 'C','A'],
'mutation': ['frameshift', 'missense', 'nonsense', '3UTR', '3UTR', '3UTR', '3UTR']})
df:
cell gene mutation
0 A one frameshift
1 A one missense
2 C one nonsense
3 A two 3UTR
4 B two 3UTR
5 C two 3UTR
6 A three 3UTR
I'd like to pivot this df so I can index by gene and set columns to cells. The trouble is that there can be multiple entries per cell: there can be multiple mutations in any one gene in a given cell (cell A has two different mutations in gene One). So when I run:
df.pivot_table(index='gene', columns='cell', values='mutation')
this happens:
DataError: No numeric types to aggregate
I'd like to use masking to perform the pivot while capturing the presence of at least one mutation:
A B C
gene
one 1 1 1
two 0 1 0
three 1 1 0
The error message is not what is produced when you run pivot_table
. You can have multiple values in the index for pivot_table
. I don't believe this is true for the pivot
method. You can however fix your problem by changing the aggregation to something that works on strings as opposed to numerics. Most aggregation functions operate on numeric columns and the code you wrote above would produce an error relating to the data type of the column not an index error.
df.pivot_table(index='gene',
columns='cell',
values='mutation',
aggfunc='count', fill_value=0)
If you only want 1 value per cell you can do a groupby and aggregate everything to 1 and then unstack a level.
df.groupby(['cell', 'gene']).agg(lambda x: 1).unstack(fill_value=0)
Solution with drop_duplicates
and pivot_table
:
df = df.drop_duplicates(['cell','gene'])
.pivot_table(index='gene',
columns='cell',
values='mutation',
aggfunc=len,
fill_value=0)
print (df)
cell A B C
gene
one 1 0 1
three 1 0 0
two 1 1 1
Another solution with drop_duplicates
, groupby
with aggregate size
and last reshape by unstack
:
df = df.drop_duplicates(['cell','gene'])
.groupby(['cell', 'gene'])
.size()
.unstack(0, fill_value=0)
print (df)
cell A B C
gene
one 1 0 1
three 1 0 0
two 1 1 1
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.