简体   繁体   English

如何在 pandas 中执行选择性 groupby().count()?

[英]How to perform a selective groupby().count() in pandas?

I'm struggling with the implementation of a peculiar combination of pandas groupby().count() and column average computation in a script, and since I'm operating on a tight schedule I decided to ask for help here on Stack, hopefully someone will know a very pythonic and pandas-oriented solution that I can add to my baggage.我正在努力实现 pandas groupby().count()和脚本中的列平均计算的特殊组合,由于我的日程安排很紧,我决定在 Stack 上寻求帮助,希望有人会知道一个非常pythonic和面向pandas的解决方案,我可以添加到我的行李中。 All the ideas I came up with are a bit sloppy and I don't like them.我提出的所有想法都有些草率,我不喜欢它们。

I have a pandas dataframe with 500+ rows and 80+ columns that looks like this:我有一个 pandas dataframe 有 500+ 行和 80+ 列,如下所示:

ID  S1_R1   S1_R2   S2_R1   S2_R2  ... 
A   10      10      5       10     ...
A   9       10      0       0      ...
A   0       0       10      9      ...
B   6       10      0       0      ...
B   0       0       15      11     ...
C   5       12      0       0      ...

I would like to first obtain the averages of the S* columns.我想首先获得S*列的平均值。 I put "S" but in reality I have longer strings which I can use in a grep -like operation, uptaking all those carrying the substring:我输入了“S”,但实际上我有更长的字符串,我可以在类似grep的操作中使用,占用所有携带 substring 的字符串:

ID  S1      S2    ...
A   10      7.5   ...
A   9.5     0     ...
A   0       9.5   ...
B   8       0     ...
B   0       13    ...
C   8.5     0     ...

I would then like to assign a True or a False to each column depending on if its value inside is greater than a constant value (let's say > 0 or not):然后,我想根据其内部值是否大于常量值(假设 > 0 或不),为每列分配TrueFalse

ID  S1      S2      ...
A   True    True    ...
A   True    False   ...
A   False   True    ...
B   True    False   ...
B   False   True    ...
C   True    False   ...

Then I would like to groupby().count() by ID but considering only those samples that have True in the column.然后我想按IDgroupby().count()进行分组,但只考虑那些在列中具有True的样本。 The outcome should be this:结果应该是这样的:

ID   S1    S2   ...
A    2     2    ...
B    1     1    ...
C    1     0    ...

I am currently doing all these steps with a combination of data frame subsets, merge, join and groupby().count() but it does look horrible and spaghetti-code-ish so I am really not a fan of what I've done.我目前正在使用数据框子集、合并、连接和 groupby().count() 的组合来执行所有这些步骤,但它看起来确实很糟糕而且像意大利面条代码一样,所以我真的不喜欢我所做的事情. Most importantly, I don't feel like I can trust my piece of code on any dataframe that I pass to the script from the command line, it doesn't seem very reproducible.最重要的是,我觉得我不能信任我从命令行传递给脚本的任何 dataframe 上的代码,它似乎不太可重现。

Could you help me out a bit?你能帮帮我吗? What's the neatest and most pythonic solution you can think of?你能想到的最简洁和最 Pythonic 的解决方案是什么?

You can convert ID to index, then create consecutive S* columns, get averages per S* columns, compare by DataFrame.gt for greater like 0 and last count True s by sum :您可以将ID转换为索引,然后创建连续的S*列,获取每个S*列的平均值,通过DataFrame.gt进行比较以获得更大的值,例如0和 last count True s 通过sum

df = df.set_index('ID')

#here is simplify solution for grep data by values before _
df.columns = df.columns.str.split('_').str[0]

df = df.mean(axis=1, level=0).gt(0).sum(level=0)
print (df)
    S1  S2
ID        
A    2   2
B    1   1
C    1   0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM