简体   繁体   English

使用带有加权数据的describe() - 平均值,标准差,中位数,分位数

[英]Using describe() with weighted data — mean, standard deviation, median, quantiles

I'm fairly new to python and pandas (from using SAS as my workhorse analytical platform), so I apologize in advance if this has already been asked / answered. 我对python和pandas(使用SAS作为我的主力分析平台)相当新,所以如果已经被问到/已经回答过,我会提前道歉。 (I've searched through the documentation as well as this site searching for answer and haven't been able to find something yet.) (我搜索了文档以及这个网站搜索答案,但还没找到。)

I've got a dataframe (called resp) containing respondent level survey data. 我有一个包含受访者级别调查数据的数据框(称为resp)。 I want to perform some basic descriptive statistics on one of the fields (called anninc [short for annual income]). 我想对其中一个字段(称为anninc [年收入的简称])执行一些基本的描述性统计。

resp["anninc"].describe()

Which gives me the basic stats: 这给了我基本的统计数据:

count     76310.000000
mean      43455.874862
std       33154.848314
min           0.000000
25%       20140.000000
50%       34980.000000
75%       56710.000000
max      152884.330000
dtype: float64

But there's a catch. 但是有一个问题。 Given how the sample was built, there was a need to weight adjust the respondent data so that not every one is deemed as "equal" when performing the analysis. 鉴于样本是如何构建的,需要对响应数据进行权重调整,以便在执行分析时不会将每个数据视为“相等”。 I have another column in the dataframe (called tufnwgrp) that represents the weight that should be applied to each record during the analysis. 我在数据框中有另一列(称为tufnwgrp),表示在分析期间应应用于每条记录的权重。

In my prior SAS life, most of the proc's have options to process data with weights like this. 在我之前的SAS生活中,大多数proc都有选项来处理具有这样权重的数据。 For example, a standard proc univariate to give the same results would look something like this: 例如,标准proc单变量给出相同的结果看起来像这样:

proc univariate data=resp;
  var anninc;
  output out=resp_univars mean=mean median=50pct q1=25pct q3=75pct min=min max=max n=count;
run;

And the same analysis using weighted data would look something like this: 使用加权数据的相同分析看起来像这样:

proc univariate data=resp;
  var anninc;
  weight tufnwgrp;
  output out=resp_univars mean=mean median=50pct q1=25pct q3=75pct min=min max=max n=count
run;

Is there a similar sort of weighting option available in pandas for methods like describe() etc? 对于像describe()等方法,pandas中是否有类似的加权选项?

There is statistics and econometrics library (statsmodels) that appears to handle this. 有统计数据和计量经济学库(statsmodels)似乎可以解决这个问题。 Here's an example that extends @MSeifert's answer here on a similar question. 下面是扩展@ MSeifert的答案的例子在这里上过类似的问题。

df=pd.DataFrame({ 'x':range(1,101), 'wt':range(1,101) })

from statsmodels.stats.weightstats import DescrStatsW
wdf = DescrStatsW(df.x, weights=df.wt, ddof=1) 

print( wdf.mean )
print( wdf.std )
print( wdf.quantile([0.25,0.50,0.75]) )

67.0
23.6877840059
p
0.25    50
0.50    71
0.75    87

I don't use SAS, but this gives the same answer as the stata command: 我不使用SAS,但这给出了与stata命令相同的答案:

sum x [fw=wt], detail

Stata actually has a few weight options and in this case gives a slightly different answer if you specify aw (analytical weights) instead of fw (frequency weights). Stata实际上有一些权重选项,在这种情况下,如果指定aw (分析权重)而不是fw (频率权重),则会给出稍微不同的答案。 Also, stata requires fw to be an integer whereas DescrStatsW allows non-integer weights. 此外,stata要求fw为整数,而DescrStatsW允许非整数权重。 Weights are more complicated than you'd think... This is starting to get into the weeds, but there is a great discussion of weighting issues for calculating the standard deviation here . 重量是比你想象的更复杂...这是开始进入杂草,但权重的问题,计算的标准偏差的大讨论在这里

Also note that DescrStatsW does not appear to include functions for min and max, but as long as your weights are non-zero this should not be a problem as the weights don't affect the min and max. 另请注意, DescrStatsW似乎不包含min和max的函数,但只要权重不为零,这不应该是一个问题,因为权重不会影响min和max。 However, if you did have some zero weights, it might be nice to have weighted min and max, but it's also easy to calculate in pandas: 但是,如果你确实有一些零权重,那么加权min和max可能会很好,但是在pandas中计算也很容易:

df.x[ df.wt > 0 ].min()
df.x[ df.wt > 0 ].max()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 运行或滑动中值,均值和标准差 - Running or sliding median, mean and standard deviation 计算平均中位数和标准差返​​回字符串 - calculate mean median and standard deviation returns string 与中位数的1个标准差(用于历史数据) - 1-standard deviation from median (for hist data) Python / Pandas用于解决分组平均值,中位数,众数和标准差 - Python/Pandas for solving grouped mean, median, mode and standard deviation 在 Python 中添加均值、中值和标准差值作为新数组列 - Add mean, median and standard deviation values as new array columns in Python 如何计算字典中多个矩阵的均值/中值/标准差? - How to calculate the mean/median/standard deviation of multiple matrices in a dictionary? NumPy中的加权标准差 - Weighted standard deviation in NumPy 在 Python 中,我可以使用均值、中值、最小值、最大值、标准差、总体规模和单个样本来生成统计上相同的数据吗? - In Python, can I use mean, median, minimum, maximum, standard deviation, population size, and single sample to generate statistically identical data? 生成具有精确均值和标准差的样本数据 - Generate sample data with an exact Mean and Standard Deviation 循环数据的运行平均值:平均值和标准差? - Running Mean of Circular Data: Average and Standard Deviation?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM