[英]How to calculate percentile [0,1) in R such that values lies below the percentile
I have a dataframe of agents and their corresponding number of products sold我有一个代理数据框及其对应的销售产品数量
Gent_Code number_policies
A096 3
A0828 12
A0843 2
A0141 2
B079 7
B05 3
M012 5
P010 2
S039 3
I want to calculate the percentile at which each value(xi) lies such that p% of the values in the data are below xi.我想计算每个值(xi)所在的百分位数,使得数据中 p% 的值低于 xi。 The minimum value of the percentile would be 0 and max would be very near to 1 but not 1.
百分位数的最小值为 0,最大值将非常接近 1 但不是 1。
I have done the below:我做了以下工作:
ag_df <- mutate(ag_df, pon_percentiles = ecdf(ag_df$pon)(ag_df$pon))
summary(ag_df$pon_percentiles )
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.4805 0.4805 0.6417 0.6356 0.7738 1.0000
However, I want the percentile formula to calculate below a value and not below or equal to the value.但是,我希望百分位数公式计算低于某个值,而不是低于或等于该值。
Hence, the value of percentile for the minimum value in the vector should be 0 and max value should get a percentile close to 1 but not exactly 1.因此,向量中最小值的百分位数应为 0,而最大值的百分位数应接近 1 但不完全是 1。
Current output:
0.6666667 1.0000000 0.3333333 0.3333333 0.8888889 0.6666667 0.7777778 0.3333333 0.6666667
If we see the above output, for min of number_policies (2) the value is 0.3333 , but I would like this to be 0. For max which is 12, it should not be 1 but 0.99.如果我们看到上面的输出,对于 number_policies (2) 的 min 值是 0.3333 ,但我希望它是 0。对于 max 是 12,它不应该是 1,而是 0.99。
How do I do this in R?我如何在 R 中做到这一点? I have searched for relevant arguments amongst the base functions like ecdf, cume_distr etc but could not find any.
我在基本函数(如 ecdf、cume_distr 等)中搜索了相关参数,但找不到任何参数。 Can someone please help me with this?
有人可以帮我解决这个问题吗?
One solution using the percent_rank()
function would be:使用
percent_rank()
函数的一种解决方案是:
pkgs <- c("tidyverse", "stringi")
invisible(lapply(pkgs, require, character.only = TRUE))
set.seed(2)
n <- 30
db <- tibble(gent_code = paste0(stri_rand_strings(n, 1, '[A-Z]'),
stri_rand_strings(n, 4, '[0-9]')),
nr_pol = sample(1L:100L, n, TRUE))
db %>%
mutate(percentile = percent_rank(nr_pol)) %>%
print(n = n)
which gives the output:这给出了输出:
gent_code nr_pol percentile
<chr> <int> <dbl>
1 E0188 35 0.241
2 S5682 91 0.862
3 O6192 96 0.931
4 E1197 97 1.000
5 Y9358 39 0.345
6 Y0069 63 0.552
7 D2879 14 0.138
8 V6778 25 0.172
9 M6284 75 0.759
10 O3420 69 0.690
11 O2301 35 0.241
12 G1728 3 0.0345
13 T4536 38 0.310
14 E0418 1 0
15 K9373 44 0.414
16 W9335 66 0.621
17 Z4140 58 0.448
18 F1424 62 0.517
19 L9825 96 0.931
20 B8411 59 0.483
21 R0735 41 0.379
22 K8881 81 0.793
23 V9502 87 0.828
24 D9827 5 0.0690
25 J5363 8 0.103
26 M2909 68 0.655
27 D3658 94 0.897
28 J1312 34 0.207
29 Z6347 63 0.552
30 D6342 72 0.724
As you see it starts at 0 as you want, but the highest percentile will be equal to 1, because it reflects the highest number of policies in your data.正如您所看到的,它从您想要的 0 开始,但最高的百分位数将等于 1,因为它反映了数据中策略的最高数量。
EDIT: Forcing 12 in this case to be equal to eg the 99th precentile implies that you have data points higher than 12 in the data.编辑:在这种情况下强制 12 等于例如第 99 个百分位数意味着您的数据点高于 12。 It will be equal to 1 because all of your datapoints are less than or equal to this value.
它将等于 1,因为您的所有数据点都小于或等于此值。
You simply can do this by quantile function:您只需通过分位数函数即可完成此操作:
quantile(df, probs = c(0, 0.24, 0.49, 0.74, 0.99))
Hope that helps!!!希望有帮助!!!
I think this is what you want but I'm not sure, you just have to setup the labels
and probs
the way you would like to have it.我想这是你想要的,但我不知道,你只需要设置的
labels
和probs
你想拥有它的方式。
iris2 <- iris
iris2$quartile_number <- cut(iris$Sepal.Length,
quantile(iris$Sepal.Length) ,
include.lowest=T,
labels=c(.25, .5, .75, 1))
head(iris2)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species quartile_number
1 5.1 3.5 1.4 0.2 setosa 0.25
2 4.9 3.0 1.4 0.2 setosa 0.25
3 4.7 3.2 1.3 0.2 setosa 0.25
4 4.6 3.1 1.5 0.2 setosa 0.25
5 5.0 3.6 1.4 0.2 setosa 0.25
6 5.4 3.9 1.7 0.4 setosa 0.5
x <- c(3, 12, 2, 2, 7, 3, 5, 2, 3)
(1) Min value 2 is 0% percentile, then you need to remove min value from your vector. (1) 最小值 2 是 0% 百分位数,那么您需要从向量中删除最小值。 (2) Max value 12 is 99% percentile, then you need to add a larger value than max value and fill your vector with max value so as a vector length to be 100.
(2) 最大值 12 是 99% 的百分位数,那么您需要添加一个大于最大值的值并用最大值填充您的向量,以便向量长度为 100。
x1 <- c(x[x > min(x)], Inf)
x2 <- c(x1, rep(max(x), 100 - length(x1)))
ecdf(x2)(x)
> ecdf(x2)(x)
[1] 0.03 0.99 0.00 0.00 0.05 0.03 0.04 0.00 0.03
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.