[英]How to perform a selective groupby().count() in pandas?
I'm struggling with the implementation of a peculiar combination of pandas groupby().count()
and column average computation in a script, and since I'm operating on a tight schedule I decided to ask for help here on Stack, hopefully someone will know a very pythonic and pandas-oriented solution that I can add to my baggage.我正在努力实现 pandas
groupby().count()
和脚本中的列平均计算的特殊组合,由于我的日程安排很紧,我决定在 Stack 上寻求帮助,希望有人会知道一个非常pythonic和面向pandas的解决方案,我可以添加到我的行李中。 All the ideas I came up with are a bit sloppy and I don't like them.我提出的所有想法都有些草率,我不喜欢它们。
I have a pandas dataframe with 500+ rows and 80+ columns that looks like this:我有一个 pandas dataframe 有 500+ 行和 80+ 列,如下所示:
ID S1_R1 S1_R2 S2_R1 S2_R2 ...
A 10 10 5 10 ...
A 9 10 0 0 ...
A 0 0 10 9 ...
B 6 10 0 0 ...
B 0 0 15 11 ...
C 5 12 0 0 ...
I would like to first obtain the averages of the S*
columns.我想首先获得
S*
列的平均值。 I put "S" but in reality I have longer strings which I can use in a grep
-like operation, uptaking all those carrying the substring:我输入了“S”,但实际上我有更长的字符串,我可以在类似
grep
的操作中使用,占用所有携带 substring 的字符串:
ID S1 S2 ...
A 10 7.5 ...
A 9.5 0 ...
A 0 9.5 ...
B 8 0 ...
B 0 13 ...
C 8.5 0 ...
I would then like to assign a True
or a False
to each column depending on if its value inside is greater than a constant value (let's say > 0 or not):然后,我想根据其内部值是否大于常量值(假设 > 0 或不),为每列分配
True
或False
:
ID S1 S2 ...
A True True ...
A True False ...
A False True ...
B True False ...
B False True ...
C True False ...
Then I would like to groupby().count()
by ID
but considering only those samples that have True
in the column.然后我想按
ID
对groupby().count()
进行分组,但只考虑那些在列中具有True
的样本。 The outcome should be this:结果应该是这样的:
ID S1 S2 ...
A 2 2 ...
B 1 1 ...
C 1 0 ...
I am currently doing all these steps with a combination of data frame subsets, merge, join and groupby().count() but it does look horrible and spaghetti-code-ish so I am really not a fan of what I've done.我目前正在使用数据框子集、合并、连接和 groupby().count() 的组合来执行所有这些步骤,但它看起来确实很糟糕而且像意大利面条代码一样,所以我真的不喜欢我所做的事情. Most importantly, I don't feel like I can trust my piece of code on any dataframe that I pass to the script from the command line, it doesn't seem very reproducible.
最重要的是,我觉得我不能信任我从命令行传递给脚本的任何 dataframe 上的代码,它似乎不太可重现。
Could you help me out a bit?你能帮帮我吗? What's the neatest and most pythonic solution you can think of?
你能想到的最简洁和最 Pythonic 的解决方案是什么?
You can convert ID
to index, then create consecutive S*
columns, get averages per S*
columns, compare by DataFrame.gt
for greater like 0
and last count True
s by sum
:您可以将
ID
转换为索引,然后创建连续的S*
列,获取每个S*
列的平均值,通过DataFrame.gt
进行比较以获得更大的值,例如0
和 last count True
s 通过sum
:
df = df.set_index('ID')
#here is simplify solution for grep data by values before _
df.columns = df.columns.str.split('_').str[0]
df = df.mean(axis=1, level=0).gt(0).sum(level=0)
print (df)
S1 S2
ID
A 2 2
B 1 1
C 1 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.