简体   繁体   English

如何计算 R 中的百分位数 [0,1),使值低于百分位数

[英]How to calculate percentile [0,1) in R such that values lies below the percentile

I have a dataframe of agents and their corresponding number of products sold我有一个代理数据框及其对应的销售产品数量

Gent_Code   number_policies
A096        3
A0828       12
A0843       2
A0141       2
B079        7
B05         3
M012        5
P010        2
S039        3

I want to calculate the percentile at which each value(xi) lies such that p% of the values in the data are below xi.我想计算每个值(xi)所在的百分位数,使得数据中 p% 的值低于 xi。 The minimum value of the percentile would be 0 and max would be very near to 1 but not 1.百分位数的最小值为 0,最大值将非常接近 1 但不是 1。

I have done the below:我做了以下工作:

ag_df <- mutate(ag_df, pon_percentiles = ecdf(ag_df$pon)(ag_df$pon))

summary(ag_df$pon_percentiles )
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.4805  0.4805  0.6417  0.6356  0.7738  1.0000 

However, I want the percentile formula to calculate below a value and not below or equal to the value.但是,我希望百分位数公式计算低于某个值,而不是低于或等于该值。

Hence, the value of percentile for the minimum value in the vector should be 0 and max value should get a percentile close to 1 but not exactly 1.因此,向量中最小值的百分位数应为 0,而最大值的百分位数应接近 1 但不完全是 1。

Current output:
0.6666667 1.0000000 0.3333333 0.3333333 0.8888889 0.6666667 0.7777778 0.3333333 0.6666667

If we see the above output, for min of number_policies (2) the value is 0.3333 , but I would like this to be 0. For max which is 12, it should not be 1 but 0.99.如果我们看到上面的输出,对于 number_policies (2) 的 min 值是 0.3333 ,但我希望它是 0。对于 max 是 12,它不应该是 1,而是 0.99。

How do I do this in R?我如何在 R 中做到这一点? I have searched for relevant arguments amongst the base functions like ecdf, cume_distr etc but could not find any.我在基本函数(如 ecdf、cume_distr 等)中搜索了相关参数,但找不到任何参数。 Can someone please help me with this?有人可以帮我解决这个问题吗?

One solution using the percent_rank() function would be:使用percent_rank()函数的一种解决方案是:

pkgs <- c("tidyverse", "stringi")
invisible(lapply(pkgs, require, character.only = TRUE))


set.seed(2)
n <- 30
db <- tibble(gent_code = paste0(stri_rand_strings(n, 1, '[A-Z]'),
                                stri_rand_strings(n, 4, '[0-9]')),
                 nr_pol = sample(1L:100L, n, TRUE))

db %>%
  mutate(percentile = percent_rank(nr_pol)) %>%
  print(n = n)

which gives the output:这给出了输出:

   gent_code nr_pol percentile
   <chr>      <int>      <dbl>
 1 E0188         35     0.241 
 2 S5682         91     0.862 
 3 O6192         96     0.931 
 4 E1197         97     1.000 
 5 Y9358         39     0.345 
 6 Y0069         63     0.552 
 7 D2879         14     0.138 
 8 V6778         25     0.172 
 9 M6284         75     0.759 
10 O3420         69     0.690 
11 O2301         35     0.241 
12 G1728          3     0.0345
13 T4536         38     0.310 
14 E0418          1     0     
15 K9373         44     0.414 
16 W9335         66     0.621 
17 Z4140         58     0.448 
18 F1424         62     0.517 
19 L9825         96     0.931 
20 B8411         59     0.483 
21 R0735         41     0.379 
22 K8881         81     0.793 
23 V9502         87     0.828 
24 D9827          5     0.0690
25 J5363          8     0.103 
26 M2909         68     0.655 
27 D3658         94     0.897 
28 J1312         34     0.207 
29 Z6347         63     0.552 
30 D6342         72     0.724 

As you see it starts at 0 as you want, but the highest percentile will be equal to 1, because it reflects the highest number of policies in your data.正如您所看到的,它从您想要的 0 开始,但最高的百分位数将等于 1,因为它反映了数据中策略的最高数量。

EDIT: Forcing 12 in this case to be equal to eg the 99th precentile implies that you have data points higher than 12 in the data.编辑:在这种情况下强制 12 等于例如第 99 个百分位数意味着您的数据点高于 12。 It will be equal to 1 because all of your datapoints are less than or equal to this value.它将等于 1,因为您的所有数据点都小于或等于此值。

You simply can do this by quantile function:您只需通过分位数函数即可完成此操作:

quantile(df, probs = c(0, 0.24, 0.49, 0.74, 0.99))

Hope that helps!!!希望有帮助!!!

I think this is what you want but I'm not sure, you just have to setup the labels and probs the way you would like to have it.我想这是你想要的,但我不知道,你只需要设置的labelsprobs你想拥有它的方式。

iris2 <- iris
iris2$quartile_number <- cut(iris$Sepal.Length, 
    quantile(iris$Sepal.Length) , 
    include.lowest=T,
    labels=c(.25, .5, .75, 1))

head(iris2)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species quartile_number
1          5.1         3.5          1.4         0.2  setosa            0.25
2          4.9         3.0          1.4         0.2  setosa            0.25
3          4.7         3.2          1.3         0.2  setosa            0.25
4          4.6         3.1          1.5         0.2  setosa            0.25
5          5.0         3.6          1.4         0.2  setosa            0.25
6          5.4         3.9          1.7         0.4  setosa             0.5
x <- c(3, 12, 2, 2, 7, 3, 5, 2, 3)

(1) Min value 2 is 0% percentile, then you need to remove min value from your vector. (1) 最小值 2 是 0% 百分位数,那么您需要从向量中删除最小值。 (2) Max value 12 is 99% percentile, then you need to add a larger value than max value and fill your vector with max value so as a vector length to be 100. (2) 最大值 12 是 99% 的百分位数,那么您需要添加一个大于最大值的值并用最大值填充您的向量,以便向量长度为​​ 100。

x1 <- c(x[x > min(x)], Inf)
x2 <- c(x1, rep(max(x), 100 - length(x1)))
ecdf(x2)(x)

> ecdf(x2)(x)
[1] 0.03 0.99 0.00 0.00 0.05 0.03 0.04 0.00 0.03

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM