[英]Pandas Count frequency of values by column
I am attempting to apply several operations that I usually do easily in R to the sample dataset below, using Python/Pandas. 我正在尝试使用Python / Pandas将我通常在R中容易完成的一些操作应用于以下示例数据集。
S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
QUER.MAC 9 8 3 5 6 0 5 0 0 0
QUER.VEL 8 9 8 7 0 0 0 0 0 0
CARY.OVA 6 6 2 7 0 2 0 0 0 0
PRUN.SER 3 5 6 6 6 4 5 0 4 1
QUER.ALB 5 4 9 9 7 7 4 6 0 2
JUGL.NIG 2 0 0 0 3 5 6 4 3 0
QUER.RUB 3 4 0 6 9 8 7 6 4 3
JUGL.CIN 0 0 5 0 2 0 0 2 0 2
ULMU.AME 2 2 4 5 6 0 5 0 2 5
TILI.AME 0 0 0 0 2 7 6 6 7 6
ULMU.RUB 4 0 2 2 5 7 8 8 8 7
CARY.COR 0 0 0 0 0 5 6 4 0 3
OSTR.VIR 0 0 0 0 0 0 7 4 6 5
ACER.SAC 0 0 0 0 0 5 4 8 8 9
After reading the data from a text file with 从文本文件读取数据后
import numpy as np
import pandas as pd
df = pd.read_csv("sample.txt", header=0, index_col=0, delimiter=' ')
I want to: (1) get the frequency of values larger than zero for each column; 我想要:(1)获得每一列的值的频率大于零; (2) get the sum of values in each column; (2)获取每一列的值之和; (3) find the maximum value in each column. (3)在每一列中找到最大值。
I managed to obtain (2) using 我设法获得(2)
N = df.apply(lambda x: np.sum(x))
But could not figure out how to achieve (1) and (3). 但是无法弄清楚如何实现(1)和(3)。
I need generic solutions, that are not dependent on the names of the columns, because I want to apply these operations on any number of similar matrices (which of course will have different labels and numbers of columns/rows). 我需要不依赖于列名的通用解决方案,因为我想将这些操作应用于任意数量的相似矩阵(当然,它们将具有不同的标签和列/行数)。
Thanks in advance for any hints and suggestions. 在此先感谢您的任何提示和建议。
Your 1st 你的第一
df.gt(0).sum()
2nd 第二名
df.sum()
3rd 第三名
df.max()
You can use mask
and describe
to get a bunch of stats by column. 您可以使用mask
和describe
来按列获取大量统计信息。
df.mask(df <= 0).describe().T
Output: 输出:
count mean std min 25% 50% 75% max
S1 9.0 4.666667 2.549510 2.0 3.00 4.0 6.00 9.0
S2 7.0 5.428571 2.439750 2.0 4.00 5.0 7.00 9.0
S3 8.0 4.875000 2.642374 2.0 2.75 4.5 6.50 9.0
S4 8.0 5.875000 2.031010 2.0 5.00 6.0 7.00 9.0
S5 9.0 5.111111 2.368778 2.0 3.00 6.0 6.00 9.0
S6 9.0 5.555556 1.878238 2.0 5.00 5.0 7.00 8.0
S7 11.0 5.727273 1.272078 4.0 5.00 6.0 6.50 8.0
S8 9.0 5.333333 2.000000 2.0 4.00 6.0 6.00 8.0
S9 8.0 5.250000 2.314550 2.0 3.75 5.0 7.25 8.0
S10 10.0 4.300000 2.540779 1.0 2.25 4.0 5.75 9.0
The reason to use mask is that count
counts all non-NaN values, so masking anything that is < or = to 0 will make then NaN for count
. 使用mask的原因是count
对所有非NaN值进行计数,因此将任何<或=屏蔽为0将使NaN成为count
。
And, finally, we can add "sum" too, using assign
: 最后,我们也可以使用assign
添加“ sum”:
df.mask(df<=0).describe().T.assign(sum=df.sum())
Output: 输出:
count mean std min 25% 50% 75% max sum
S1 9.0 4.666667 2.549510 2.0 3.00 4.0 6.00 9.0 42
S2 7.0 5.428571 2.439750 2.0 4.00 5.0 7.00 9.0 38
S3 8.0 4.875000 2.642374 2.0 2.75 4.5 6.50 9.0 39
S4 8.0 5.875000 2.031010 2.0 5.00 6.0 7.00 9.0 47
S5 9.0 5.111111 2.368778 2.0 3.00 6.0 6.00 9.0 46
S6 9.0 5.555556 1.878238 2.0 5.00 5.0 7.00 8.0 50
S7 11.0 5.727273 1.272078 4.0 5.00 6.0 6.50 8.0 63
S8 9.0 5.333333 2.000000 2.0 4.00 6.0 6.00 8.0 48
S9 8.0 5.250000 2.314550 2.0 3.75 5.0 7.25 8.0 42
S10 10.0 4.300000 2.540779 1.0 2.25 4.0 5.75 9.0 43
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.