简体   繁体   English

熊猫按列计数值的频率

[英]Pandas Count frequency of values by column

I am attempting to apply several operations that I usually do easily in R to the sample dataset below, using Python/Pandas. 我正在尝试使用Python / Pandas将我通常在R中容易完成的一些操作应用于以下示例数据集。

S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
QUER.MAC 9 8 3 5 6 0 5 0 0 0
QUER.VEL 8 9 8 7 0 0 0 0 0 0
CARY.OVA 6 6 2 7 0 2 0 0 0 0
PRUN.SER 3 5 6 6 6 4 5 0 4 1
QUER.ALB 5 4 9 9 7 7 4 6 0 2
JUGL.NIG 2 0 0 0 3 5 6 4 3 0
QUER.RUB 3 4 0 6 9 8 7 6 4 3
JUGL.CIN 0 0 5 0 2 0 0 2 0 2
ULMU.AME 2 2 4 5 6 0 5 0 2 5
TILI.AME 0 0 0 0 2 7 6 6 7 6
ULMU.RUB 4 0 2 2 5 7 8 8 8 7
CARY.COR 0 0 0 0 0 5 6 4 0 3
OSTR.VIR 0 0 0 0 0 0 7 4 6 5
ACER.SAC 0 0 0 0 0 5 4 8 8 9

After reading the data from a text file with 从文本文件读取数据后

import numpy as np
import pandas as pd
df = pd.read_csv("sample.txt", header=0, index_col=0, delimiter=' ')

I want to: (1) get the frequency of values larger than zero for each column; 我想要:(1)获得每一列的值的频率大于零; (2) get the sum of values in each column; (2)获取每一列的值之和; (3) find the maximum value in each column. (3)在每一列中找到最大值。

I managed to obtain (2) using 我设法获得(2)

N = df.apply(lambda x: np.sum(x))

But could not figure out how to achieve (1) and (3). 但是无法弄清楚如何实现(1)和(3)。

I need generic solutions, that are not dependent on the names of the columns, because I want to apply these operations on any number of similar matrices (which of course will have different labels and numbers of columns/rows). 我需要不依赖于列名的通用解决方案,因为我想将这些操作应用于任意数量的相似矩阵(当然,它们将具有不同的标签和列/行数)。

Thanks in advance for any hints and suggestions. 在此先感谢您的任何提示和建议。

Your 1st 你的第一

df.gt(0).sum()

2nd 第二名

df.sum()

3rd 第三名

df.max()

You can use mask and describe to get a bunch of stats by column. 您可以使用maskdescribe来按列获取大量统计信息。

df.mask(df <= 0).describe().T

Output: 输出:

     count      mean       std  min   25%  50%   75%  max
S1     9.0  4.666667  2.549510  2.0  3.00  4.0  6.00  9.0
S2     7.0  5.428571  2.439750  2.0  4.00  5.0  7.00  9.0
S3     8.0  4.875000  2.642374  2.0  2.75  4.5  6.50  9.0
S4     8.0  5.875000  2.031010  2.0  5.00  6.0  7.00  9.0
S5     9.0  5.111111  2.368778  2.0  3.00  6.0  6.00  9.0
S6     9.0  5.555556  1.878238  2.0  5.00  5.0  7.00  8.0
S7    11.0  5.727273  1.272078  4.0  5.00  6.0  6.50  8.0
S8     9.0  5.333333  2.000000  2.0  4.00  6.0  6.00  8.0
S9     8.0  5.250000  2.314550  2.0  3.75  5.0  7.25  8.0
S10   10.0  4.300000  2.540779  1.0  2.25  4.0  5.75  9.0

The reason to use mask is that count counts all non-NaN values, so masking anything that is < or = to 0 will make then NaN for count . 使用mask的原因是count对所有非NaN值进行计数,因此将任何<或=屏蔽为0将使NaN成为count

And, finally, we can add "sum" too, using assign : 最后,我们也可以使用assign添加“ sum”:

df.mask(df<=0).describe().T.assign(sum=df.sum())

Output: 输出:

     count      mean       std  min   25%  50%   75%  max  sum
S1     9.0  4.666667  2.549510  2.0  3.00  4.0  6.00  9.0   42
S2     7.0  5.428571  2.439750  2.0  4.00  5.0  7.00  9.0   38
S3     8.0  4.875000  2.642374  2.0  2.75  4.5  6.50  9.0   39
S4     8.0  5.875000  2.031010  2.0  5.00  6.0  7.00  9.0   47
S5     9.0  5.111111  2.368778  2.0  3.00  6.0  6.00  9.0   46
S6     9.0  5.555556  1.878238  2.0  5.00  5.0  7.00  8.0   50
S7    11.0  5.727273  1.272078  4.0  5.00  6.0  6.50  8.0   63
S8     9.0  5.333333  2.000000  2.0  4.00  6.0  6.00  8.0   48
S9     8.0  5.250000  2.314550  2.0  3.75  5.0  7.25  8.0   42
S10   10.0  4.300000  2.540779  1.0  2.25  4.0  5.75  9.0   43

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM