简体   繁体   中英

How to groupby for one column and then sort_values for another column in a pandas dataframe?

I have a pandas dataframe that looks like:

  SampleID      expr             Gene  Period                     tag
4   HSB103  7.214731  ENSG00000198615       5  HSB103|ENSG00000198615
2   HSB103  4.214731  ENSG00000198725       4  HSB103|ENSG00000198725
5   HSB100  3.214731  ENSG00000198615       4  HSB100|ENSG00000198615
1   HSB106  2.200031  ENSG00000198780       5  HSB106|ENSG00000198780
0   HSB103  1.214731  ENSG00000198780       4  HSB103|ENSG00000198780
3   HSB103  0.214731  ENSG00000198615       4  HSB103|ENSG00000198615

What I want to do is group by the Gene and then sort by descending expr , so that it looks like:

  SampleID      expr             Gene  Period                     tag
0   HSB103  7.214731  ENSG00000198615       5  HSB103|ENSG00000198615
1   HSB100  3.214731  ENSG00000198615       4  HSB100|ENSG00000198615
2   HSB103  0.214731  ENSG00000198615       4  HSB103|ENSG00000198615
3   HSB103  4.214731  ENSG00000198725       4  HSB103|ENSG00000198725
4   HSB106  2.200031  ENSG00000198780       5  HSB106|ENSG00000198780
5   HSB103  1.214731  ENSG00000198780       4  HSB103|ENSG00000198780

I've tried the following, but none of them work:

Attempt 1:

p4p5.sort_values(by=['expr'], ascending=[False], inplace=True).groupby(['Gene'])

Attempt 2:

p4p5.groupby(['Gene'])
p4p5.sort_values(by=['expr'], ascending=[False], inplace=True)

Update to question :

Once I group and sort, how can I then filter the dataframe to keep only the bottom 10% of expression per gene group? When I say bottom 10% , I mean in the theoretical distribution sense, NOT if I had 100 rows per gene, I'd get 10 rows after filtering. I imagine it would it be something like:

p4p5.sort_values(by=['Gene','expr'], ascending=[True,False], inplace=True).quantile([0.1])

you don't need groupby here, just sort_values by both columns such as:

p4p5.sort_values(by=['Gene','expr'], ascending=[True,False], inplace=True)

EDIT: for updated question, you can use groupby and tail such as:

p4p5_bottom10 = (p4p5.sort_values(by='expr', ascending=False).groupby('Gene')
                     .apply(lambda df_g: df_g.tail(int(len(df_g)*0.1))))

you can add .reset_index(drop=True) at the end too

2nd EDIT: hope this time I understood well, you can do it like this:

#first sort 
p4p5= p4p5.sort_values(['Gene','expr'], ascending=[True,False]).reset_index(drop=True)
# select the part of the data under quantile 10% (reset_index not mandatory)
p4p5_bottom10  = (p4p5[p4p5.groupby('Gene')['expr'].apply(lambda x: x < x.quantile(0.1))]
                       .reset_index(drop=True))

Simple solution will be:

>>> df.sort_values(['Gene','expr'],ascending=[True,False]).groupby('Gene').tail(3)
  SampleID      expr             Gene  Period                     tag
0   HSB103  7.214731  ENSG00000198615       5  HSB103|ENSG00000198615
2   HSB100  3.214731  ENSG00000198615       4  HSB100|ENSG00000198615
5   HSB103  1.214731  ENSG00000198615       4  HSB103|ENSG00000198615
1   HSB103  4.214731  ENSG00000198725       4  HSB103|ENSG00000198725
3   HSB106  2.200031  ENSG00000198780       5  HSB106|ENSG00000198780
4   HSB103  1.214731  ENSG00000198780       4  HSB103|ENSG00000198780

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM