简体   繁体   中英

Pandas Count frequency of values by column

I am attempting to apply several operations that I usually do easily in R to the sample dataset below, using Python/Pandas.

S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
QUER.MAC 9 8 3 5 6 0 5 0 0 0
QUER.VEL 8 9 8 7 0 0 0 0 0 0
CARY.OVA 6 6 2 7 0 2 0 0 0 0
PRUN.SER 3 5 6 6 6 4 5 0 4 1
QUER.ALB 5 4 9 9 7 7 4 6 0 2
JUGL.NIG 2 0 0 0 3 5 6 4 3 0
QUER.RUB 3 4 0 6 9 8 7 6 4 3
JUGL.CIN 0 0 5 0 2 0 0 2 0 2
ULMU.AME 2 2 4 5 6 0 5 0 2 5
TILI.AME 0 0 0 0 2 7 6 6 7 6
ULMU.RUB 4 0 2 2 5 7 8 8 8 7
CARY.COR 0 0 0 0 0 5 6 4 0 3
OSTR.VIR 0 0 0 0 0 0 7 4 6 5
ACER.SAC 0 0 0 0 0 5 4 8 8 9

After reading the data from a text file with

import numpy as np
import pandas as pd
df = pd.read_csv("sample.txt", header=0, index_col=0, delimiter=' ')

I want to: (1) get the frequency of values larger than zero for each column; (2) get the sum of values in each column; (3) find the maximum value in each column.

I managed to obtain (2) using

N = df.apply(lambda x: np.sum(x))

But could not figure out how to achieve (1) and (3).

I need generic solutions, that are not dependent on the names of the columns, because I want to apply these operations on any number of similar matrices (which of course will have different labels and numbers of columns/rows).

Thanks in advance for any hints and suggestions.

Your 1st

df.gt(0).sum()

2nd

df.sum()

3rd

df.max()

You can use mask and describe to get a bunch of stats by column.

df.mask(df <= 0).describe().T

Output:

     count      mean       std  min   25%  50%   75%  max
S1     9.0  4.666667  2.549510  2.0  3.00  4.0  6.00  9.0
S2     7.0  5.428571  2.439750  2.0  4.00  5.0  7.00  9.0
S3     8.0  4.875000  2.642374  2.0  2.75  4.5  6.50  9.0
S4     8.0  5.875000  2.031010  2.0  5.00  6.0  7.00  9.0
S5     9.0  5.111111  2.368778  2.0  3.00  6.0  6.00  9.0
S6     9.0  5.555556  1.878238  2.0  5.00  5.0  7.00  8.0
S7    11.0  5.727273  1.272078  4.0  5.00  6.0  6.50  8.0
S8     9.0  5.333333  2.000000  2.0  4.00  6.0  6.00  8.0
S9     8.0  5.250000  2.314550  2.0  3.75  5.0  7.25  8.0
S10   10.0  4.300000  2.540779  1.0  2.25  4.0  5.75  9.0

The reason to use mask is that count counts all non-NaN values, so masking anything that is < or = to 0 will make then NaN for count .

And, finally, we can add "sum" too, using assign :

df.mask(df<=0).describe().T.assign(sum=df.sum())

Output:

     count      mean       std  min   25%  50%   75%  max  sum
S1     9.0  4.666667  2.549510  2.0  3.00  4.0  6.00  9.0   42
S2     7.0  5.428571  2.439750  2.0  4.00  5.0  7.00  9.0   38
S3     8.0  4.875000  2.642374  2.0  2.75  4.5  6.50  9.0   39
S4     8.0  5.875000  2.031010  2.0  5.00  6.0  7.00  9.0   47
S5     9.0  5.111111  2.368778  2.0  3.00  6.0  6.00  9.0   46
S6     9.0  5.555556  1.878238  2.0  5.00  5.0  7.00  8.0   50
S7    11.0  5.727273  1.272078  4.0  5.00  6.0  6.50  8.0   63
S8     9.0  5.333333  2.000000  2.0  4.00  6.0  6.00  8.0   48
S9     8.0  5.250000  2.314550  2.0  3.75  5.0  7.25  8.0   42
S10   10.0  4.300000  2.540779  1.0  2.25  4.0  5.75  9.0   43

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM