如何在 pandas 中执行选择性 groupby().count()？

Question

I'm struggling with the implementation of a peculiar combination of pandas groupby().count() and column average computation in a script, and since I'm operating on a tight schedule I decided to ask for help here on Stack, hopefully someone will know a very pythonic and pandas-oriented solution that I can add to my baggage.我正在努力实现 pandas groupby().count()和脚本中的列平均计算的特殊组合，由于我的日程安排很紧，我决定在 Stack 上寻求帮助，希望有人会知道一个非常pythonic和面向pandas的解决方案，我可以添加到我的行李中。 All the ideas I came up with are a bit sloppy and I don't like them.我提出的所有想法都有些草率，我不喜欢它们。

I have a pandas dataframe with 500+ rows and 80+ columns that looks like this:我有一个 pandas dataframe 有 500+ 行和 80+ 列，如下所示：

ID  S1_R1   S1_R2   S2_R1   S2_R2  ... 
A   10      10      5       10     ...
A   9       10      0       0      ...
A   0       0       10      9      ...
B   6       10      0       0      ...
B   0       0       15      11     ...
C   5       12      0       0      ...

I would like to first obtain the averages of the S* columns.我想首先获得S*列的平均值。 I put "S" but in reality I have longer strings which I can use in a grep -like operation, uptaking all those carrying the substring:我输入了“S”，但实际上我有更长的字符串，我可以在类似grep的操作中使用，占用所有携带 substring 的字符串：

ID  S1      S2    ...
A   10      7.5   ...
A   9.5     0     ...
A   0       9.5   ...
B   8       0     ...
B   0       13    ...
C   8.5     0     ...

I would then like to assign a True or a False to each column depending on if its value inside is greater than a constant value (let's say > 0 or not):然后，我想根据其内部值是否大于常量值（假设 > 0 或不），为每列分配True或False ：

ID  S1      S2      ...
A   True    True    ...
A   True    False   ...
A   False   True    ...
B   True    False   ...
B   False   True    ...
C   True    False   ...

Then I would like to groupby().count() by ID but considering only those samples that have True in the column.然后我想按ID对groupby().count()进行分组，但只考虑那些在列中具有True的样本。 The outcome should be this:结果应该是这样的：

ID   S1    S2   ...
A    2     2    ...
B    1     1    ...
C    1     0    ...

I am currently doing all these steps with a combination of data frame subsets, merge, join and groupby().count() but it does look horrible and spaghetti-code-ish so I am really not a fan of what I've done.我目前正在使用数据框子集、合并、连接和 groupby().count() 的组合来执行所有这些步骤，但它看起来确实很糟糕而且像意大利面条代码一样，所以我真的不喜欢我所做的事情. Most importantly, I don't feel like I can trust my piece of code on any dataframe that I pass to the script from the command line, it doesn't seem very reproducible.最重要的是，我觉得我不能信任我从命令行传递给脚本的任何 dataframe 上的代码，它似乎不太可重现。

Could you help me out a bit?你能帮帮我吗？ What's the neatest and most pythonic solution you can think of?你能想到的最简洁和最 Pythonic 的解决方案是什么？

Answer 1

You can convert ID to index, then create consecutive S* columns, get averages per S* columns, compare by DataFrame.gt for greater like 0 and last count True s by sum :您可以将ID转换为索引，然后创建连续的S*列，获取每个S*列的平均值，通过DataFrame.gt进行比较以获得更大的值，例如0和 last count True s 通过sum ：

df = df.set_index('ID')

#here is simplify solution for grep data by values before _
df.columns = df.columns.str.split('_').str[0]

df = df.mean(axis=1, level=0).gt(0).sum(level=0)
print (df)
    S1  S2
ID        
A    2   2
B    1   1
C    1   0

如何在 pandas 中执行选择性 groupby().count()？

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-12-18 11:25:39

如何在 pandas 中执行选择性 groupby().count()？

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-12-18 11:25:39

解决方案1
1 已采纳 2020-12-18 11:25:39