简体   繁体   English

如何根据 r 中的 ecdf 进行排名?

[英]How to rank based on ecdf in r?

This code will create a figure of ecdf where the curve of number 5 represents the truth.此代码将创建一个 ecdf 图形,其中数字 5 的曲线代表真相。

   library(data.table)
   library(ggplot2)
   set.seed(123)
   dat_data <- data.table(meanval = rnorm(10),
                   sdval = runif(10, 0.5, 3),
                   rep = sample.int(1000, 10))

   dat <- rbindlist(lapply(1:dim(dat_data)[1], 
   function(x) data.table(rowval = x, dist = rnorm(dat_data[x, 
   rep],dat_data[x, meanval], dat_data[x, sdval]))))
   ggplot(dat, aes(x = dist, group = factor(rowval), color = 
   factor(rowval))) +
   stat_ecdf(size = 2)

based on the outputs of the ecdf,I would like to rank the numbers from closest to 5 to furthest from 5.根据 ecdf 的输出,我想将数字从最接近 5 到最远离 5 进行排名。

Here's a thought for ranking them.这是对它们进行排名的想法。

First, for reference, I plotted it but added linetype= so I could more easily see 5 :首先,作为参考,我绘制了它,但添加了linetype=所以我可以更容易地看到5

library(ggplot2)
ggplot(dat, aes(x = dist, group = factor(rowval), color = factor(rowval), linetype = rowval != 5)) +
  stat_ecdf(size = 2) +
  scale_linetype_discrete(guide = FALSE)

10条经验累积分布曲线,用于比较

Using data.table , I'll measure the difference at 51 points (every 0.02 ) along their quantiles:使用data.table ,我将沿着它们的分位数测量 51 个点(每0.02 )的差异:

library(data.table)
quants <- seq(0, 1, length.out = 51)
datquants <- dat[, .(quant = quants, val = ecdf(dist)(quantile(dist, quants))), by = rowval]
datquants
#      rowval quant         val
#       <int> <num>       <num>
#   1:      1  0.00 0.001540832
#   2:      1  0.02 0.020030817
#   3:      1  0.04 0.040061633
#   4:      1  0.06 0.060092450
#   5:      1  0.08 0.080123267
#   6:      1  0.10 0.100154083
#   7:      1  0.12 0.120184900
#   8:      1  0.14 0.140215716
#   9:      1  0.16 0.160246533
#  10:      1  0.18 0.180277350
#  ---                         
# 501:     10  0.82 0.819905213
# 502:     10  0.84 0.840047393
# 503:     10  0.86 0.859004739
# 504:     10  0.88 0.879146919
# 505:     10  0.90 0.899289100
# 506:     10  0.92 0.919431280
# 507:     10  0.94 0.939573460
# 508:     10  0.96 0.959715640
# 509:     10  0.98 0.979857820
# 510:     10  1.00 1.000000000

(Note: a previous version of this answer did not use ecdf , which would be exposed/wrong if the range within each rowval were not the same. Using ecdf , all of our area calcs are in the same domain.) (注意:此答案的先前版本未使用ecdf ,如果每个rowval内的范围不同,则会暴露/错误。使用ecdf ,我们所有的面积计算都在同一个域中。)

From here, we separate the 5 quantiles, join it back in based on quant , find the absolute difference, then summarize.从这里,我们将5分位数分开,根据quant重新加入,找到绝对差异,然后总结。

datquants[rowval == 5, .(quant, val5 = val)
  ][datquants, on = .(quant)
  ][, val := abs(val - val5)^2
  ][, .(area = 1e6*sum(val)), by = rowval
  ][, rank := rank(area) ]
#     rowval      area  rank
#      <int>     <num> <num>
#  1:      1  22.45959     7
#  2:      2  18.93004     4
#  3:      3 160.48164    10
#  4:      4  17.66167     2
#  5:      5   0.00000     1
#  6:      6  21.52974     6
#  7:      7  24.35520     8
#  8:      8  18.48263     3
#  9:      9  59.90059     9
# 10:     10  19.78913     5

I think the sum-of-squares (of differences) is a good measure, not sure if it is the best.我认为平方和(差异)是一个很好的衡量标准,不确定它是否是最好的。 (The 1e6* is merely to bring the numbers into a non-exponential realm for easy visual comparison.) 1e6*只是将数字带入非指数 realm 以便于视觉比较。)

Disclaimer: this is one method, perhaps just a heuristic since I'm not certain it's the only or best way.免责声明:这是一种方法,也许只是一种启发式方法,因为我不确定它是唯一或最好的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM