I'm struggling with the implementation of a peculiar combination of pandas groupby().count()
and column average computation in a script, and since I'm operating on a tight schedule I decided to ask for help here on Stack, hopefully someone will know a very pythonic and pandas-oriented solution that I can add to my baggage. All the ideas I came up with are a bit sloppy and I don't like them.
I have a pandas dataframe with 500+ rows and 80+ columns that looks like this:
ID S1_R1 S1_R2 S2_R1 S2_R2 ...
A 10 10 5 10 ...
A 9 10 0 0 ...
A 0 0 10 9 ...
B 6 10 0 0 ...
B 0 0 15 11 ...
C 5 12 0 0 ...
I would like to first obtain the averages of the S*
columns. I put "S" but in reality I have longer strings which I can use in a grep
-like operation, uptaking all those carrying the substring:
ID S1 S2 ...
A 10 7.5 ...
A 9.5 0 ...
A 0 9.5 ...
B 8 0 ...
B 0 13 ...
C 8.5 0 ...
I would then like to assign a True
or a False
to each column depending on if its value inside is greater than a constant value (let's say > 0 or not):
ID S1 S2 ...
A True True ...
A True False ...
A False True ...
B True False ...
B False True ...
C True False ...
Then I would like to groupby().count()
by ID
but considering only those samples that have True
in the column. The outcome should be this:
ID S1 S2 ...
A 2 2 ...
B 1 1 ...
C 1 0 ...
I am currently doing all these steps with a combination of data frame subsets, merge, join and groupby().count() but it does look horrible and spaghetti-code-ish so I am really not a fan of what I've done. Most importantly, I don't feel like I can trust my piece of code on any dataframe that I pass to the script from the command line, it doesn't seem very reproducible.
Could you help me out a bit? What's the neatest and most pythonic solution you can think of?
You can convert ID
to index, then create consecutive S*
columns, get averages per S*
columns, compare by DataFrame.gt
for greater like 0
and last count True
s by sum
:
df = df.set_index('ID')
#here is simplify solution for grep data by values before _
df.columns = df.columns.str.split('_').str[0]
df = df.mean(axis=1, level=0).gt(0).sum(level=0)
print (df)
S1 S2
ID
A 2 2
B 1 1
C 1 0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.