I am attempting to apply several operations that I usually do easily in R to the sample dataset below, using Python/Pandas.
S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
QUER.MAC 9 8 3 5 6 0 5 0 0 0
QUER.VEL 8 9 8 7 0 0 0 0 0 0
CARY.OVA 6 6 2 7 0 2 0 0 0 0
PRUN.SER 3 5 6 6 6 4 5 0 4 1
QUER.ALB 5 4 9 9 7 7 4 6 0 2
JUGL.NIG 2 0 0 0 3 5 6 4 3 0
QUER.RUB 3 4 0 6 9 8 7 6 4 3
JUGL.CIN 0 0 5 0 2 0 0 2 0 2
ULMU.AME 2 2 4 5 6 0 5 0 2 5
TILI.AME 0 0 0 0 2 7 6 6 7 6
ULMU.RUB 4 0 2 2 5 7 8 8 8 7
CARY.COR 0 0 0 0 0 5 6 4 0 3
OSTR.VIR 0 0 0 0 0 0 7 4 6 5
ACER.SAC 0 0 0 0 0 5 4 8 8 9
After reading the data from a text file with
import numpy as np
import pandas as pd
df = pd.read_csv("sample.txt", header=0, index_col=0, delimiter=' ')
I want to: (1) get the frequency of values larger than zero for each column; (2) get the sum of values in each column; (3) find the maximum value in each column.
I managed to obtain (2) using
N = df.apply(lambda x: np.sum(x))
But could not figure out how to achieve (1) and (3).
I need generic solutions, that are not dependent on the names of the columns, because I want to apply these operations on any number of similar matrices (which of course will have different labels and numbers of columns/rows).
Thanks in advance for any hints and suggestions.
Your 1st
df.gt(0).sum()
2nd
df.sum()
3rd
df.max()
You can use mask
and describe
to get a bunch of stats by column.
df.mask(df <= 0).describe().T
Output:
count mean std min 25% 50% 75% max
S1 9.0 4.666667 2.549510 2.0 3.00 4.0 6.00 9.0
S2 7.0 5.428571 2.439750 2.0 4.00 5.0 7.00 9.0
S3 8.0 4.875000 2.642374 2.0 2.75 4.5 6.50 9.0
S4 8.0 5.875000 2.031010 2.0 5.00 6.0 7.00 9.0
S5 9.0 5.111111 2.368778 2.0 3.00 6.0 6.00 9.0
S6 9.0 5.555556 1.878238 2.0 5.00 5.0 7.00 8.0
S7 11.0 5.727273 1.272078 4.0 5.00 6.0 6.50 8.0
S8 9.0 5.333333 2.000000 2.0 4.00 6.0 6.00 8.0
S9 8.0 5.250000 2.314550 2.0 3.75 5.0 7.25 8.0
S10 10.0 4.300000 2.540779 1.0 2.25 4.0 5.75 9.0
The reason to use mask is that count
counts all non-NaN values, so masking anything that is < or = to 0 will make then NaN for count
.
And, finally, we can add "sum" too, using assign
:
df.mask(df<=0).describe().T.assign(sum=df.sum())
Output:
count mean std min 25% 50% 75% max sum
S1 9.0 4.666667 2.549510 2.0 3.00 4.0 6.00 9.0 42
S2 7.0 5.428571 2.439750 2.0 4.00 5.0 7.00 9.0 38
S3 8.0 4.875000 2.642374 2.0 2.75 4.5 6.50 9.0 39
S4 8.0 5.875000 2.031010 2.0 5.00 6.0 7.00 9.0 47
S5 9.0 5.111111 2.368778 2.0 3.00 6.0 6.00 9.0 46
S6 9.0 5.555556 1.878238 2.0 5.00 5.0 7.00 8.0 50
S7 11.0 5.727273 1.272078 4.0 5.00 6.0 6.50 8.0 63
S8 9.0 5.333333 2.000000 2.0 4.00 6.0 6.00 8.0 48
S9 8.0 5.250000 2.314550 2.0 3.75 5.0 7.25 8.0 42
S10 10.0 4.300000 2.540779 1.0 2.25 4.0 5.75 9.0 43
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.