简体   繁体   English

R 中的百分位数结果与 MS Excel 不匹配

[英]Percentile results in R do not match MS Excel

I have the following toy data set (the actual data set is ~500,000 records):我有以下玩具数据集(实际数据集约为 500,000 条记录):

library(data.table)

dt <- data.table(Address = c("Gold", "Gold", "Silver", "Silver", "Gold", "Gold", "Copper", "Gold", "Bronze"),
                 Name = c("Stat1", "Stat1", "Stat1", "Stat1", "Stat1", "Stat1", "Stat1", "Stat1", "Stat1"), 
                 AvgValue = c(0, 0.5, 1.25, 0.75, 1.5, 0.7, 0.41, 0.83, 2.58),
                 Samples = c(123, 233, 504, 3, 94, 50, 401, 402, 12))

I want to do the following:我想做以下事情:

a) subset the data so that we only consider "Gold" records" AND values in the "Value" column greater than zero a) 对数据进行子集化,以便我们只考虑“值”列中大于零的“黄金”记录

b) with the filtered data in "a" above, print out percentile and other descriptive stats. b) 使用上面“a”中的过滤数据,打印出百分位数和其他描述性统计数据。

The code to perform "a" and "b" above is as follows:上面执行“a”和“b”的代码如下:

qs = dt[AvgValue > 0 & Address %like% 'Gold', 
        .(Samples = sum(Samples),
          '25th'    = quantile(AvgValue, probs = c(0.25)),
          '50th'    = quantile(AvgValue, probs = c(0.50)),
          '75th'    = quantile(AvgValue, probs = c(0.75)),
          '95th'    = quantile(AvgValue, probs = c(0.95)),
          '99th'    = quantile(AvgValue, probs = c(0.99)),
          '99.9th'  = quantile(AvgValue, probs = c(0.999)), 
          '99.99th' = quantile(AvgValue, probs = c(0.9999)),
          'Mean'    = mean(AvgValue),
          'Median'  = median(AvgValue),
          'StdDev'  = sd(AvgValue)),
        by = .(Name, Address)]
setkey(qs, 'Name')

Printing qs shows:打印qs显示:

Name    Address Samples 25th  50th   75th   95th   99th    99.9th   99.99th   Mean     Median   StdDev
Stat1   Gold    779     0.65  0.765  0.9975 1.3995 1.4799  1.49799  1.499799  0.8825   0.765    0.4334647

So far, so good.到现在为止还挺好。 These values from the (small) toy data set seem to tie out to the output from the PERCENTILE() function in MS Excel.这些来自(小)玩具数据集的值似乎与 MS Excel 中 PERCENTILE() 函数的输出有关。

EDIT: Here's the problem: when I apply this R code to the larger data set, the values output by R do not tie out to the values output by the PERCENTILE() function in Excel.编辑:这是问题所在:当我将此 R 代码应用于更大的数据集时,R 输出的值与 Excel 中的 PERCENTILE() 函数输出的值无关。 In the lower percentiles, the values are slightly different.在较低的百分位数中,值略有不同。 In the upper percentiles, the values are significantly different.在上百分位数中,这些值显着不同。 Here are the differences:以下是差异:

             25th           50th        75th        95th        99th        99.9th      99.99th
    R        0.414442227    0.428557466 0.45030771  1.668065665 42.7787092  146.9633133 349.6416913
    Excel    0.414774203    0.429350073 0.448245768 0.971100779 13.31231723 98.75342572 188.2700879

And here are 20 actual data points (out of a total of 11,283 "Gold" rows).这里有 20 个实际数据点(总共 11,283 个“黄金”行)。 These are sorted descending:这些按降序排列:

AvgValue
349.1436739
190.189758
175.2157327
158.6492516
132.9550737
132.2686941
126.570912
122.9771829
107.6942185
99.98552912
98.93274272
98.75984129
98.73709105
98.30154271
98.2491005
96.97274385
96.94577839
96.9128099
96.90816688
96.82527478

The values from Excel seem "more correct" (especially the upper percentiles). Excel 中的值似乎“更正确”(尤其是上百分位数)。

Does anyone see anything glaringly wrong with my R code?有没有人发现我的 R 代码有什么明显的错误?

If not, any ideas as to why the values from R are not tying out to the values from Excel?如果没有,关于为什么 R 中的值没有与 Excel 中的值相关联的任何想法?

Perhaps the "Type" argument for the Quantile() function (which I've not passed in)?也许是 Quantile() 函数的“类型”参数(我没有传入)?

Thanks!谢谢!

I am able to reproduce the Excel percentile function by setting the type=7 in the R quantile function.我可以通过在R quantile函数中设置type=7来重现 Excel percentile quantile函数。 See the output [[7]]] from lapply below and compare to what you get using Excel's percentile on my toy vector, testveclog :查看下面lapply的输出[[7]]]并与在我的玩具向量testveclog上使用 Excel 的percentile进行testveclog

set.seed(12272019)
testveclog <- rlnorm(11283, meanlog=-0.12, sdlog=3)
lapply(1:9, function(x) quantile(testveclog, prob=c(0.95, 0.99, 0.999), type=x))

#[[1]]
#      95%       99%     99.9% 
# 131.0835  933.6057 6213.7963 

#[[2]]
#      95%       99%     99.9% 
# 131.0835  933.6057 6213.7963 

#[[3]]
#      95%       99%     99.9% 
# 131.0835  932.8875 6213.7963 

#[[4]]
#      95%       99%     99.9% 
# 131.0141  933.0096 6198.9585 

#[[5]]
#      95%       99%     99.9% 
# 131.1827  933.3687 6230.8209 

#[[6]]
#      95%       99%     99.9% 
# 131.3103  935.1852 6269.9696 

#[[7]]
#      95%       99%     99.9% 
# 131.0372  933.0168 6199.0109 

#[[8]]
#      95%       99%     99.9% 
# 131.2253  933.4860 6243.8705 

#[[9]]
#      95%       99%     99.9% 
# 131.2146  933.4567 6240.6081

writeClipboard(as.character(testveclog)) #copy and then paste into Excel to compare functions

在此处输入图片说明

Note that in more current versions of Excel, the PERCENTILE function is deprecated in favor of PERCENTILE.EXC , which matches the output from R 's quantile function using type=6请注意,在最新版本的 Excel 中,不推荐使用PERCENTILE函数,取而代之的是PERCENTILE.EXC ,它使用type=6匹配Rquantile函数的输出

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM