简体   繁体   中英

Percentile results in R do not match MS Excel

I have the following toy data set (the actual data set is ~500,000 records):

library(data.table)

dt <- data.table(Address = c("Gold", "Gold", "Silver", "Silver", "Gold", "Gold", "Copper", "Gold", "Bronze"),
                 Name = c("Stat1", "Stat1", "Stat1", "Stat1", "Stat1", "Stat1", "Stat1", "Stat1", "Stat1"), 
                 AvgValue = c(0, 0.5, 1.25, 0.75, 1.5, 0.7, 0.41, 0.83, 2.58),
                 Samples = c(123, 233, 504, 3, 94, 50, 401, 402, 12))

I want to do the following:

a) subset the data so that we only consider "Gold" records" AND values in the "Value" column greater than zero

b) with the filtered data in "a" above, print out percentile and other descriptive stats.

The code to perform "a" and "b" above is as follows:

qs = dt[AvgValue > 0 & Address %like% 'Gold', 
        .(Samples = sum(Samples),
          '25th'    = quantile(AvgValue, probs = c(0.25)),
          '50th'    = quantile(AvgValue, probs = c(0.50)),
          '75th'    = quantile(AvgValue, probs = c(0.75)),
          '95th'    = quantile(AvgValue, probs = c(0.95)),
          '99th'    = quantile(AvgValue, probs = c(0.99)),
          '99.9th'  = quantile(AvgValue, probs = c(0.999)), 
          '99.99th' = quantile(AvgValue, probs = c(0.9999)),
          'Mean'    = mean(AvgValue),
          'Median'  = median(AvgValue),
          'StdDev'  = sd(AvgValue)),
        by = .(Name, Address)]
setkey(qs, 'Name')

Printing qs shows:

Name    Address Samples 25th  50th   75th   95th   99th    99.9th   99.99th   Mean     Median   StdDev
Stat1   Gold    779     0.65  0.765  0.9975 1.3995 1.4799  1.49799  1.499799  0.8825   0.765    0.4334647

So far, so good. These values from the (small) toy data set seem to tie out to the output from the PERCENTILE() function in MS Excel.

EDIT: Here's the problem: when I apply this R code to the larger data set, the values output by R do not tie out to the values output by the PERCENTILE() function in Excel. In the lower percentiles, the values are slightly different. In the upper percentiles, the values are significantly different. Here are the differences:

             25th           50th        75th        95th        99th        99.9th      99.99th
    R        0.414442227    0.428557466 0.45030771  1.668065665 42.7787092  146.9633133 349.6416913
    Excel    0.414774203    0.429350073 0.448245768 0.971100779 13.31231723 98.75342572 188.2700879

And here are 20 actual data points (out of a total of 11,283 "Gold" rows). These are sorted descending:

AvgValue
349.1436739
190.189758
175.2157327
158.6492516
132.9550737
132.2686941
126.570912
122.9771829
107.6942185
99.98552912
98.93274272
98.75984129
98.73709105
98.30154271
98.2491005
96.97274385
96.94577839
96.9128099
96.90816688
96.82527478

The values from Excel seem "more correct" (especially the upper percentiles).

Does anyone see anything glaringly wrong with my R code?

If not, any ideas as to why the values from R are not tying out to the values from Excel?

Perhaps the "Type" argument for the Quantile() function (which I've not passed in)?

Thanks!

I am able to reproduce the Excel percentile function by setting the type=7 in the R quantile function. See the output [[7]]] from lapply below and compare to what you get using Excel's percentile on my toy vector, testveclog :

set.seed(12272019)
testveclog <- rlnorm(11283, meanlog=-0.12, sdlog=3)
lapply(1:9, function(x) quantile(testveclog, prob=c(0.95, 0.99, 0.999), type=x))

#[[1]]
#      95%       99%     99.9% 
# 131.0835  933.6057 6213.7963 

#[[2]]
#      95%       99%     99.9% 
# 131.0835  933.6057 6213.7963 

#[[3]]
#      95%       99%     99.9% 
# 131.0835  932.8875 6213.7963 

#[[4]]
#      95%       99%     99.9% 
# 131.0141  933.0096 6198.9585 

#[[5]]
#      95%       99%     99.9% 
# 131.1827  933.3687 6230.8209 

#[[6]]
#      95%       99%     99.9% 
# 131.3103  935.1852 6269.9696 

#[[7]]
#      95%       99%     99.9% 
# 131.0372  933.0168 6199.0109 

#[[8]]
#      95%       99%     99.9% 
# 131.2253  933.4860 6243.8705 

#[[9]]
#      95%       99%     99.9% 
# 131.2146  933.4567 6240.6081

writeClipboard(as.character(testveclog)) #copy and then paste into Excel to compare functions

在此处输入图片说明

Note that in more current versions of Excel, the PERCENTILE function is deprecated in favor of PERCENTILE.EXC , which matches the output from R 's quantile function using type=6

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM