简体   繁体   English

如何对熊猫数据框中的一列进行分组,然后对另一列进行sort_values排序?

[英]How to groupby for one column and then sort_values for another column in a pandas dataframe?

I have a pandas dataframe that looks like: 我有一个熊猫数据框,看起来像:

  SampleID      expr             Gene  Period                     tag
4   HSB103  7.214731  ENSG00000198615       5  HSB103|ENSG00000198615
2   HSB103  4.214731  ENSG00000198725       4  HSB103|ENSG00000198725
5   HSB100  3.214731  ENSG00000198615       4  HSB100|ENSG00000198615
1   HSB106  2.200031  ENSG00000198780       5  HSB106|ENSG00000198780
0   HSB103  1.214731  ENSG00000198780       4  HSB103|ENSG00000198780
3   HSB103  0.214731  ENSG00000198615       4  HSB103|ENSG00000198615

What I want to do is group by the Gene and then sort by descending expr , so that it looks like: 我想要做的是按Gene分组,然后按降序对expr进行排序,使其看起来像:

  SampleID      expr             Gene  Period                     tag
0   HSB103  7.214731  ENSG00000198615       5  HSB103|ENSG00000198615
1   HSB100  3.214731  ENSG00000198615       4  HSB100|ENSG00000198615
2   HSB103  0.214731  ENSG00000198615       4  HSB103|ENSG00000198615
3   HSB103  4.214731  ENSG00000198725       4  HSB103|ENSG00000198725
4   HSB106  2.200031  ENSG00000198780       5  HSB106|ENSG00000198780
5   HSB103  1.214731  ENSG00000198780       4  HSB103|ENSG00000198780

I've tried the following, but none of them work: 我已经尝试了以下方法,但是它们都不起作用:

Attempt 1: 尝试1:

p4p5.sort_values(by=['expr'], ascending=[False], inplace=True).groupby(['Gene'])

Attempt 2: 尝试2:

p4p5.groupby(['Gene'])
p4p5.sort_values(by=['expr'], ascending=[False], inplace=True)

Update to question : 更新至问题

Once I group and sort, how can I then filter the dataframe to keep only the bottom 10% of expression per gene group? 进行分组和排序后,如何过滤数据框,以使每个基因组的表达仅保留最低的10%? When I say bottom 10% , I mean in the theoretical distribution sense, NOT if I had 100 rows per gene, I'd get 10 rows after filtering. 当我说bottom 10% ,我的意思是从理论分布上讲,不是每个基因有100行,而是经过过滤后得到10行。 I imagine it would it be something like: 我想那会是这样的:

p4p5.sort_values(by=['Gene','expr'], ascending=[True,False], inplace=True).quantile([0.1])

you don't need groupby here, just sort_values by both columns such as: 您不需要在这里使用groupby ,只需按两列分别进行sort_values

p4p5.sort_values(by=['Gene','expr'], ascending=[True,False], inplace=True)

EDIT: for updated question, you can use groupby and tail such as: 编辑:对于更新的问题,您可以使用groupbytail如:

p4p5_bottom10 = (p4p5.sort_values(by='expr', ascending=False).groupby('Gene')
                     .apply(lambda df_g: df_g.tail(int(len(df_g)*0.1))))

you can add .reset_index(drop=True) at the end too 您也可以在.reset_index(drop=True)添加.reset_index(drop=True)

2nd EDIT: hope this time I understood well, you can do it like this: 第2次编辑:希望这次我了解得很好,您可以这样做:

#first sort 
p4p5= p4p5.sort_values(['Gene','expr'], ascending=[True,False]).reset_index(drop=True)
# select the part of the data under quantile 10% (reset_index not mandatory)
p4p5_bottom10  = (p4p5[p4p5.groupby('Gene')['expr'].apply(lambda x: x < x.quantile(0.1))]
                       .reset_index(drop=True))

Simple solution will be: 简单的解决方案是:

>>> df.sort_values(['Gene','expr'],ascending=[True,False]).groupby('Gene').tail(3)
  SampleID      expr             Gene  Period                     tag
0   HSB103  7.214731  ENSG00000198615       5  HSB103|ENSG00000198615
2   HSB100  3.214731  ENSG00000198615       4  HSB100|ENSG00000198615
5   HSB103  1.214731  ENSG00000198615       4  HSB103|ENSG00000198615
1   HSB103  4.214731  ENSG00000198725       4  HSB103|ENSG00000198725
3   HSB106  2.200031  ENSG00000198780       5  HSB106|ENSG00000198780
4   HSB103  1.214731  ENSG00000198780       4  HSB103|ENSG00000198780

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Pandas DataFrame条形图,其他列的sort_values - Pandas DataFrame bar plot with sort_values by other column 熊猫如何对一个列进行分组并根据另一列的最小唯一值过滤数据框? - How to pandas groupby one column and filter dataframe based on the minimum unique values of another column? sort_values() 与 'key' 对 dataframe 中的一列元组进行排序 - sort_values() with 'key' to sort a column of tuples in a dataframe 如何根据 Pandas Dataframe 中的另一列对一列进行排序? - How to sort one column based on another column in Pandas Dataframe? 如何对熊猫数据框进行分组并对另一列中的值求和 - How to groupby pandas dataframe and sum values in another column function 中的 Pandas sort_values()。 如何允许用户选择要排序的列? 或者也许留空 - Pandas sort_values() inside a function. How to allow user to choose a column to sort by? Or perhaps leave blank 在 Pandas Dataframe 中按一列排序,然后按另一列分组? - Sort by one column, then group by another, in Pandas Dataframe? 在一列上按另一列对数据框进行排序-Pandas - Sort dataframe by another on one column - pandas 熊猫按一列分组,然后按另一列分组 - pandas groupby one column and then groupby another column Pandas groupby:根据另一列中的值更改一列中的值 - Pandas groupby: change values in one column based on values in another column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM