简体   繁体   中英

How to perform a selective groupby().count() in pandas?

I'm struggling with the implementation of a peculiar combination of pandas groupby().count() and column average computation in a script, and since I'm operating on a tight schedule I decided to ask for help here on Stack, hopefully someone will know a very pythonic and pandas-oriented solution that I can add to my baggage. All the ideas I came up with are a bit sloppy and I don't like them.

I have a pandas dataframe with 500+ rows and 80+ columns that looks like this:

ID  S1_R1   S1_R2   S2_R1   S2_R2  ... 
A   10      10      5       10     ...
A   9       10      0       0      ...
A   0       0       10      9      ...
B   6       10      0       0      ...
B   0       0       15      11     ...
C   5       12      0       0      ...

I would like to first obtain the averages of the S* columns. I put "S" but in reality I have longer strings which I can use in a grep -like operation, uptaking all those carrying the substring:

ID  S1      S2    ...
A   10      7.5   ...
A   9.5     0     ...
A   0       9.5   ...
B   8       0     ...
B   0       13    ...
C   8.5     0     ...

I would then like to assign a True or a False to each column depending on if its value inside is greater than a constant value (let's say > 0 or not):

ID  S1      S2      ...
A   True    True    ...
A   True    False   ...
A   False   True    ...
B   True    False   ...
B   False   True    ...
C   True    False   ...

Then I would like to groupby().count() by ID but considering only those samples that have True in the column. The outcome should be this:

ID   S1    S2   ...
A    2     2    ...
B    1     1    ...
C    1     0    ...

I am currently doing all these steps with a combination of data frame subsets, merge, join and groupby().count() but it does look horrible and spaghetti-code-ish so I am really not a fan of what I've done. Most importantly, I don't feel like I can trust my piece of code on any dataframe that I pass to the script from the command line, it doesn't seem very reproducible.

Could you help me out a bit? What's the neatest and most pythonic solution you can think of?

You can convert ID to index, then create consecutive S* columns, get averages per S* columns, compare by DataFrame.gt for greater like 0 and last count True s by sum :

df = df.set_index('ID')

#here is simplify solution for grep data by values before _
df.columns = df.columns.str.split('_').str[0]

df = df.mean(axis=1, level=0).gt(0).sum(level=0)
print (df)
    S1  S2
ID        
A    2   2
B    1   1
C    1   0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM