I have a pandas dataframe that looks like:
SampleID expr Gene Period tag
4 HSB103 7.214731 ENSG00000198615 5 HSB103|ENSG00000198615
2 HSB103 4.214731 ENSG00000198725 4 HSB103|ENSG00000198725
5 HSB100 3.214731 ENSG00000198615 4 HSB100|ENSG00000198615
1 HSB106 2.200031 ENSG00000198780 5 HSB106|ENSG00000198780
0 HSB103 1.214731 ENSG00000198780 4 HSB103|ENSG00000198780
3 HSB103 0.214731 ENSG00000198615 4 HSB103|ENSG00000198615
What I want to do is group by the Gene
and then sort by descending expr
, so that it looks like:
SampleID expr Gene Period tag
0 HSB103 7.214731 ENSG00000198615 5 HSB103|ENSG00000198615
1 HSB100 3.214731 ENSG00000198615 4 HSB100|ENSG00000198615
2 HSB103 0.214731 ENSG00000198615 4 HSB103|ENSG00000198615
3 HSB103 4.214731 ENSG00000198725 4 HSB103|ENSG00000198725
4 HSB106 2.200031 ENSG00000198780 5 HSB106|ENSG00000198780
5 HSB103 1.214731 ENSG00000198780 4 HSB103|ENSG00000198780
I've tried the following, but none of them work:
Attempt 1:
p4p5.sort_values(by=['expr'], ascending=[False], inplace=True).groupby(['Gene'])
Attempt 2:
p4p5.groupby(['Gene'])
p4p5.sort_values(by=['expr'], ascending=[False], inplace=True)
Update to question :
Once I group and sort, how can I then filter the dataframe to keep only the bottom 10% of expression per gene group? When I say bottom 10%
, I mean in the theoretical distribution sense, NOT if I had 100 rows per gene, I'd get 10 rows after filtering. I imagine it would it be something like:
p4p5.sort_values(by=['Gene','expr'], ascending=[True,False], inplace=True).quantile([0.1])
you don't need groupby
here, just sort_values
by both columns such as:
p4p5.sort_values(by=['Gene','expr'], ascending=[True,False], inplace=True)
EDIT: for updated question, you can use groupby
and tail
such as:
p4p5_bottom10 = (p4p5.sort_values(by='expr', ascending=False).groupby('Gene')
.apply(lambda df_g: df_g.tail(int(len(df_g)*0.1))))
you can add .reset_index(drop=True)
at the end too
2nd EDIT: hope this time I understood well, you can do it like this:
#first sort
p4p5= p4p5.sort_values(['Gene','expr'], ascending=[True,False]).reset_index(drop=True)
# select the part of the data under quantile 10% (reset_index not mandatory)
p4p5_bottom10 = (p4p5[p4p5.groupby('Gene')['expr'].apply(lambda x: x < x.quantile(0.1))]
.reset_index(drop=True))
Simple solution will be:
>>> df.sort_values(['Gene','expr'],ascending=[True,False]).groupby('Gene').tail(3)
SampleID expr Gene Period tag
0 HSB103 7.214731 ENSG00000198615 5 HSB103|ENSG00000198615
2 HSB100 3.214731 ENSG00000198615 4 HSB100|ENSG00000198615
5 HSB103 1.214731 ENSG00000198615 4 HSB103|ENSG00000198615
1 HSB103 4.214731 ENSG00000198725 4 HSB103|ENSG00000198725
3 HSB106 2.200031 ENSG00000198780 5 HSB106|ENSG00000198780
4 HSB103 1.214731 ENSG00000198780 4 HSB103|ENSG00000198780
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.