R 中的百分位數結果與 MS Excel 不匹配

Question

我有以下玩具數據集（實際數據集約為 500,000 條記錄）：

library(data.table)

dt <- data.table(Address = c("Gold", "Gold", "Silver", "Silver", "Gold", "Gold", "Copper", "Gold", "Bronze"),
                 Name = c("Stat1", "Stat1", "Stat1", "Stat1", "Stat1", "Stat1", "Stat1", "Stat1", "Stat1"), 
                 AvgValue = c(0, 0.5, 1.25, 0.75, 1.5, 0.7, 0.41, 0.83, 2.58),
                 Samples = c(123, 233, 504, 3, 94, 50, 401, 402, 12))

我想做以下事情：

a) 對數據進行子集化，以便我們只考慮“值”列中大於零的“黃金”記錄和值

b) 使用上面“a”中的過濾數據，打印出百分位數和其他描述性統計數據。

上面執行“a”和“b”的代碼如下：

qs = dt[AvgValue > 0 & Address %like% 'Gold', 
        .(Samples = sum(Samples),
          '25th'    = quantile(AvgValue, probs = c(0.25)),
          '50th'    = quantile(AvgValue, probs = c(0.50)),
          '75th'    = quantile(AvgValue, probs = c(0.75)),
          '95th'    = quantile(AvgValue, probs = c(0.95)),
          '99th'    = quantile(AvgValue, probs = c(0.99)),
          '99.9th'  = quantile(AvgValue, probs = c(0.999)), 
          '99.99th' = quantile(AvgValue, probs = c(0.9999)),
          'Mean'    = mean(AvgValue),
          'Median'  = median(AvgValue),
          'StdDev'  = sd(AvgValue)),
        by = .(Name, Address)]
setkey(qs, 'Name')

打印qs顯示：

Name    Address Samples 25th  50th   75th   95th   99th    99.9th   99.99th   Mean     Median   StdDev
Stat1   Gold    779     0.65  0.765  0.9975 1.3995 1.4799  1.49799  1.499799  0.8825   0.765    0.4334647

到現在為止還挺好。 這些來自（小）玩具數據集的值似乎與 MS Excel 中 PERCENTILE() 函數的輸出有關。

編輯：這是問題所在：當我將此 R 代碼應用於更大的數據集時，R 輸出的值與 Excel 中的 PERCENTILE() 函數輸出的值無關。 在較低的百分位數中，值略有不同。 在上百分位數中，這些值顯着不同。 以下是差異：

             25th           50th        75th        95th        99th        99.9th      99.99th
    R        0.414442227    0.428557466 0.45030771  1.668065665 42.7787092  146.9633133 349.6416913
    Excel    0.414774203    0.429350073 0.448245768 0.971100779 13.31231723 98.75342572 188.2700879

這里有 20 個實際數據點（總共 11,283 個“黃金”行）。 這些按降序排列：

AvgValue
349.1436739
190.189758
175.2157327
158.6492516
132.9550737
132.2686941
126.570912
122.9771829
107.6942185
99.98552912
98.93274272
98.75984129
98.73709105
98.30154271
98.2491005
96.97274385
96.94577839
96.9128099
96.90816688
96.82527478

Excel 中的值似乎“更正確”（尤其是上百分位數）。

有沒有人發現我的 R 代碼有什么明顯的錯誤？

如果沒有，關於為什么 R 中的值沒有與 Excel 中的值相關聯的任何想法？

也許是 Quantile() 函數的“類型”參數（我沒有傳入）？

謝謝！

Answer 1

我可以通過在R quantile函數中設置type=7來重現 Excel percentile quantile函數。 查看下面lapply的輸出[[7]]]並與在我的玩具向量testveclog上使用 Excel 的percentile進行testveclog ：

set.seed(12272019)
testveclog <- rlnorm(11283, meanlog=-0.12, sdlog=3)
lapply(1:9, function(x) quantile(testveclog, prob=c(0.95, 0.99, 0.999), type=x))

#[[1]]
#      95%       99%     99.9% 
# 131.0835  933.6057 6213.7963 

#[[2]]
#      95%       99%     99.9% 
# 131.0835  933.6057 6213.7963 

#[[3]]
#      95%       99%     99.9% 
# 131.0835  932.8875 6213.7963 

#[[4]]
#      95%       99%     99.9% 
# 131.0141  933.0096 6198.9585 

#[[5]]
#      95%       99%     99.9% 
# 131.1827  933.3687 6230.8209 

#[[6]]
#      95%       99%     99.9% 
# 131.3103  935.1852 6269.9696 

#[[7]]
#      95%       99%     99.9% 
# 131.0372  933.0168 6199.0109 

#[[8]]
#      95%       99%     99.9% 
# 131.2253  933.4860 6243.8705 

#[[9]]
#      95%       99%     99.9% 
# 131.2146  933.4567 6240.6081

writeClipboard(as.character(testveclog)) #copy and then paste into Excel to compare functions

請注意，在最新版本的 Excel 中，不推薦使用PERCENTILE函數，取而代之的是PERCENTILE.EXC ，它使用type=6匹配R的quantile函數的輸出

R 中的百分位數結果與 MS Excel 不匹配

問題描述

1 個解決方案

解決方案1
4 已采納 2019-12-28 05:34:48

R 中的百分位數結果與 MS Excel 不匹配

問題描述

1 個解決方案

解決方案1 4 已采納 2019-12-28 05:34:48

解決方案1
4 已采納 2019-12-28 05:34:48