简体   繁体   English

计算频率数据帧的众数、中位数和偏度

[英]Calculate Mode, Median and Skewness of frequency dataframe

I have a dataframe like this:我有一个这样的数据框:

Category   Frequency
1             30000
2             45000
3             32400
4             42200
5             56300
6             98200

How do I calculate the mean, median and skewness of the Categories?如何计算类别的均值、中值和偏度?

I have tried the following:我尝试了以下方法:

df['cum_freq'] = [df["Category"]]*df["Frequnecy"]
mean = df['cum_freq'].mean()
median = df['cum_freq'].median()
skew = df['cum_freq'].skew()

If the total frequency is small enough to fit into memory, use repeat to generate the data and then you can easily call those methods.如果总频率足够小以适合内存,请使用repeat生成数据,然后您可以轻松调用这些方法。

s = df['Category'].repeat(df['Frequency']).reset_index(drop=True)

print(s.mean(), s.var(ddof=1), s.skew(), s.kurtosis())
# 4.13252219664584 3.045585008424625 -0.4512924988072343 -1.1923306818513022

Otherwise, you will need more complicated algebra to calculate the moments, which can be done with the k-statistics Some of the lower moments can be done with other libraries like numpy or statsmodels .否则,您将需要更复杂的代数来计算矩,这可以使用k-statistics完成一些较低的矩可以使用numpystatsmodels等其他库来完成。 But for things like skewness and kurtosis this is done manually from the sums of the de-meaned values (calculated from counts).但是对于像偏度和峰度这样的事情,这是通过去平均值的总和(从计数计算)手动完成的。 Since these sums will overflow numpy, we need use normal python.由于这些总和会溢出 numpy,我们需要使用普通的 python。

def moments_from_counts(vals, weights):
    """
    Returns tuple (mean, N-1 variance, skewness, kurtosis) from count data
    """
    vals = [float(x) for x in vals]
    weights = [float(x) for x in weights]

    n = sum(weights)
    mu = sum([x*y for x,y in zip(vals,weights)])/n
    S1 = sum([(x-mu)**1*y for x,y in zip(vals,weights)])
    S2 = sum([(x-mu)**2*y for x,y in zip(vals,weights)])
    S3 = sum([(x-mu)**3*y for x,y in zip(vals,weights)])
    S4 = sum([(x-mu)**4*y for x,y in zip(vals,weights)])

    k1 = S1/n
    k2 = (n*S2-S1**2)/(n*(n-1))
    k3 = (2*S1**3 - 3*n*S1*S2 + n**2*S3)/(n*(n-1)*(n-2))
    k4 = (-6*S1**4 + 12*n*S1**2*S2 - 3*n*(n-1)*S2**2 -4*n*(n+1)*S1*S3 + n**2*(n+1)*S4)/(n*(n-1)*(n-2)*(n-3))
    
    return mu, k2, k3/k2**1.5, k4/k2**2

moments_from_counts(df['Category'], df['Frequency'])
#(4.13252219664584, 3.045585008418879, -0.4512924988072345, -1.1923306818513018)

statsmodels has a nice class that takes care of lower moments, as well as the quantiles. statsmodels 有一个很好的类,可以处理较低的时刻以及分位数。

from statsmodels.stats.weightstats import DescrStatsW

d = DescrStatsW(df['Category'], weights=df['Frequency'])

d.mean
#4.13252219664584

d.var_ddof(1)
#3.045585008418879

the DescrStatsW class also gives you access to the underlying data as an array if you call d.asrepeats()如果您调用d.asrepeats() ,则 DescrStatsW 类还允许您以数组形式访问基础数据

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM