简体   繁体   中英

How to calculate percentile [0,1) in R such that values lies below the percentile

I have a dataframe of agents and their corresponding number of products sold

Gent_Code   number_policies
A096        3
A0828       12
A0843       2
A0141       2
B079        7
B05         3
M012        5
P010        2
S039        3

I want to calculate the percentile at which each value(xi) lies such that p% of the values in the data are below xi. The minimum value of the percentile would be 0 and max would be very near to 1 but not 1.

I have done the below:

ag_df <- mutate(ag_df, pon_percentiles = ecdf(ag_df$pon)(ag_df$pon))

summary(ag_df$pon_percentiles )
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.4805  0.4805  0.6417  0.6356  0.7738  1.0000 

However, I want the percentile formula to calculate below a value and not below or equal to the value.

Hence, the value of percentile for the minimum value in the vector should be 0 and max value should get a percentile close to 1 but not exactly 1.

Current output:
0.6666667 1.0000000 0.3333333 0.3333333 0.8888889 0.6666667 0.7777778 0.3333333 0.6666667

If we see the above output, for min of number_policies (2) the value is 0.3333 , but I would like this to be 0. For max which is 12, it should not be 1 but 0.99.

How do I do this in R? I have searched for relevant arguments amongst the base functions like ecdf, cume_distr etc but could not find any. Can someone please help me with this?

One solution using the percent_rank() function would be:

pkgs <- c("tidyverse", "stringi")
invisible(lapply(pkgs, require, character.only = TRUE))


set.seed(2)
n <- 30
db <- tibble(gent_code = paste0(stri_rand_strings(n, 1, '[A-Z]'),
                                stri_rand_strings(n, 4, '[0-9]')),
                 nr_pol = sample(1L:100L, n, TRUE))

db %>%
  mutate(percentile = percent_rank(nr_pol)) %>%
  print(n = n)

which gives the output:

   gent_code nr_pol percentile
   <chr>      <int>      <dbl>
 1 E0188         35     0.241 
 2 S5682         91     0.862 
 3 O6192         96     0.931 
 4 E1197         97     1.000 
 5 Y9358         39     0.345 
 6 Y0069         63     0.552 
 7 D2879         14     0.138 
 8 V6778         25     0.172 
 9 M6284         75     0.759 
10 O3420         69     0.690 
11 O2301         35     0.241 
12 G1728          3     0.0345
13 T4536         38     0.310 
14 E0418          1     0     
15 K9373         44     0.414 
16 W9335         66     0.621 
17 Z4140         58     0.448 
18 F1424         62     0.517 
19 L9825         96     0.931 
20 B8411         59     0.483 
21 R0735         41     0.379 
22 K8881         81     0.793 
23 V9502         87     0.828 
24 D9827          5     0.0690
25 J5363          8     0.103 
26 M2909         68     0.655 
27 D3658         94     0.897 
28 J1312         34     0.207 
29 Z6347         63     0.552 
30 D6342         72     0.724 

As you see it starts at 0 as you want, but the highest percentile will be equal to 1, because it reflects the highest number of policies in your data.

EDIT: Forcing 12 in this case to be equal to eg the 99th precentile implies that you have data points higher than 12 in the data. It will be equal to 1 because all of your datapoints are less than or equal to this value.

You simply can do this by quantile function:

quantile(df, probs = c(0, 0.24, 0.49, 0.74, 0.99))

Hope that helps!!!

I think this is what you want but I'm not sure, you just have to setup the labels and probs the way you would like to have it.

iris2 <- iris
iris2$quartile_number <- cut(iris$Sepal.Length, 
    quantile(iris$Sepal.Length) , 
    include.lowest=T,
    labels=c(.25, .5, .75, 1))

head(iris2)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species quartile_number
1          5.1         3.5          1.4         0.2  setosa            0.25
2          4.9         3.0          1.4         0.2  setosa            0.25
3          4.7         3.2          1.3         0.2  setosa            0.25
4          4.6         3.1          1.5         0.2  setosa            0.25
5          5.0         3.6          1.4         0.2  setosa            0.25
6          5.4         3.9          1.7         0.4  setosa             0.5
x <- c(3, 12, 2, 2, 7, 3, 5, 2, 3)

(1) Min value 2 is 0% percentile, then you need to remove min value from your vector. (2) Max value 12 is 99% percentile, then you need to add a larger value than max value and fill your vector with max value so as a vector length to be 100.

x1 <- c(x[x > min(x)], Inf)
x2 <- c(x1, rep(max(x), 100 - length(x1)))
ecdf(x2)(x)

> ecdf(x2)(x)
[1] 0.03 0.99 0.00 0.00 0.05 0.03 0.04 0.00 0.03

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM