[英]Handling NAs when mapping several dataframe columns into their percentile values in R
> dput(zed)
structure(list(col1 = c(0, 0.236258076229343, 0.43840483531742,
0, NaN, 0.198838380845137, 0.0754815882584196, 0.10176020461209,
0.045933014354067, 0.256237616143739, 0.0880658828009711, 0.117285153415946,
0.127902400629673, 0, 0.117682083253069, 0.114542851298834, 0.0584035686594367,
0.123456790123457, 0.196817420435511, 0.0369541251378046), col2 = c(0.121951219512195,
0.17979731938542, 0.305944055944056, 0, NaN, 0.239463601532567,
0.0625521267723103, 0.161729656111679, 0.0612745098039216, 0.22002200220022,
0.135608048993876, NaN, 0, 0, 0.0934420659191301, 0.140091696383087,
0.141872719902716, 0, 0.176720075400566, 0.253924284395199),
col3 = c(0.227540305157712, 0.264931804641559, 0.190018713264226,
0.564015792442188, NaN, 0.116857208286359, 0.136034761917893,
0.137370134394451, 0.227357158778513, 0.215714919326088,
0.240671647524362, 0.107512520868114, 0.0681162324911809,
0.195274360476469, NaN, 0.208033156719459, 0.199848016844409,
0.140383517621937, 0.202430694674985, 0.0927417625979096)), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
> zed
# A tibble: 20 x 3
col1 col2 col3
<dbl> <dbl> <dbl>
1 0 0.122 0.228
2 0.236 0.180 0.265
3 0.438 0.306 0.190
4 0 0 0.564
5 NaN NaN NaN
6 0.199 0.239 0.117
7 0.0755 0.0626 0.136
8 0.102 0.162 0.137
9 0.0459 0.0613 0.227
10 0.256 0.220 0.216
11 0.0881 0.136 0.241
12 0.117 NaN 0.108
13 0.128 0 0.0681
14 0 0 0.195
15 0.118 0.0934 NaN
16 0.115 0.140 0.208
17 0.0584 0.142 0.200
18 0.123 0 0.140
19 0.197 0.177 0.202
20 0.0370 0.254 0.0927
I have the following dataframe, which has multiple columns (col1, col2, col3)
for which I need to convert into percentiles (rounded to the nearest integer, so one of 1:100). 我有以下数据帧,它有多个列
(col1, col2, col3)
,我需要将其转换为百分位数(四舍五入到最接近的整数,因此1:100之一)。 My preference - and what I assume is easiest - is to add 3 additional columns col1pctile, col2pctile, col3pctile
that maps each respective column to their percentile value (within that column). 我的偏好 - 以及我认为最简单的 - 是添加3个额外的列
col1pctile, col2pctile, col3pctile
,它们将每个相应的列映射到它们的百分位值(在该列中)。
Using the fmsb::percentile()
function on a single column returns an error due to the presence of NAs. 在单个列上使用
fmsb::percentile()
函数会因为存在NA而返回错误。
> fmsb::percentile(zed$col1)
Error in quantile.default(dat, probs = seq(0, 1, by = 0.01), type = 7) :
missing values and NaN's not allowed if 'na.rm' is FALSE
Although the example dataframe above only has 20 rows, my actual dataframe is many more rows than just 20, and having percentile values actually makes sense for my use-case (whereas percentiles wouldn't make sense for only 20 rows). 虽然上面的示例数据帧只有20行,但我的实际数据帧比20行多得多,并且具有百分位值实际上对我的用例有意义(而百分位数对于仅20行没有意义)。
I will edit this post shortly with my current attempts, which aren't working as I'd hope. 我将很快用我目前的尝试编辑这篇文章,这些尝试并不像我希望的那样有效。 Any help with this would be greatly appreciated!
任何有关这方面的帮助将不胜感激!
There are two challenges when using the percentile
function from the fmsb
. 使用
fmsb
的percentile
函数时存在两个挑战。 First, it cannot handle missing values. 首先,它无法处理缺失值。 Second, it cannot handle zero.
其次,它无法处理零。
Here is the code of the percentile function. 这是百分位函数的代码。
library(dplyr)
library(fmsb)
percentile
# function (dat)
# {
# pt1 <- quantile(dat, probs = seq(0, 1, by = 0.01), type = 7)
# pt2 <- unique(as.data.frame(pt1), fromLast = TRUE)
# pt3 <- rownames(pt2)
# pt4 <- as.integer(strsplit(pt3, "%"))
# datp <- pt4[as.integer(cut(dat, c(0, pt2$pt1), labels = 1:length(pt3)))]
# return(datp)
# }
# <bytecode: 0x0000000016c498b0>
# <environment: namespace:fmsb>
As you can see, there are no ways to specify the na.rm
argument to the quantile
function. 如您所见,没有办法为
quantile
函数指定na.rm
参数。 However, simply set na.rm = TRUE
to quantile
function will not work because we would like the function to return NA
when the input numbers are NA
. 然而,简单地设置
na.rm = TRUE
到quantile
功能将无法正常工作,因为我们想返回功能NA
当输入数字是NA
。
In addition, when providing a vector with zero, the function returns error as follows. 另外,当向量提供零时,该函数返回如下错误。
percentile(0:5)
# Error in cut.default(dat, c(0, pt2$pt1), labels = 1:length(pt3)) :
# 'breaks' are not unique
My suggestion is to re-write the function to be able to return NA
for NA
input values, and add a small numbers for zero. 我的建议是重写函数,以便能够为
NA
输入值返回NA
,并为零添加一个小数字。 Here is my modification for the function. 这是我对该功能的修改。 I called it
percentile_narm_zero
. 我叫它为
percentile_narm_zero
。
percentile_narm_zero <- function(dat, small = 0.0000000000001){
# Create a data frame with the numeric values and index
dat2 <- data.frame(index = 1:length(dat), dat = dat)
# Remove NA
dat3 <- dat2[ !is.na(dat2$dat), ]
# Add a small number to 0
dat3$dat <- ifelse(dat3$dat == 0, dat3$dat + small, dat3$dat)
# This part is the same as the percentile function
pt1 <- quantile(dat3$dat, probs = seq(0, 1, by = 0.01), type = 7)
pt2 <- unique(as.data.frame(pt1), fromLast = TRUE)
pt3 <- rownames(pt2)
pt4 <- as.integer(strsplit(pt3, "%"))
datp <- pt4[as.integer(cut(dat3$dat, c(0, pt2$pt1)), labels = 1:length(pt3))]
# Merge datp back to dat2
dat3$datp <- datp
dat4 <- merge(dat2, dat3, by = "index", all = TRUE)
return(dat4$datp)
}
Now we can apply this function to all columns in zed
using mutate_all
. 现在我们可以使用
mutate_all
将此函数应用于zed
所有列。
zed2 <- zed %>% mutate_all(funs(pctile = percentile_narm_zero(.)))
# A tibble: 20 x 6
# col1 col2 col3 col1_pctile col2_pctile col3_pctile
# <dbl> <dbl> <dbl> <int> <int> <int>
# 1 0 0.122 0.228 11 42 83
# 2 0.236 0.180 0.265 89 77 95
# 3 0.438 0.306 0.190 100 100 42
# 4 0 0 0.564 11 17 100
# 5 NaN NaN NaN NA NA NA
# 6 0.199 0.239 0.117 84 89 18
# 7 0.0755 0.0626 0.136 34 30 24
# 8 0.102 0.162 0.137 45 65 30
# 9 0.0459 0.0613 0.227 23 24 77
# 10 0.256 0.220 0.216 95 83 71
# 11 0.0881 0.136 0.241 39 48 89
# 12 0.117 NaN 0.108 56 NA 12
# 13 0.128 0 0.0681 73 17 0
# 14 0 0 0.195 11 17 48
# 15 0.118 0.0934 NaN 62 36 NA
# 16 0.115 0.140 0.208 50 53 65
# 17 0.0584 0.142 0.200 28 59 53
# 18 0.123 0 0.140 67 17 36
# 19 0.197 0.177 0.202 78 71 59
# 20 0.0370 0.254 0.0927 17 95 6
First define a function to calculate percentile group as: 首先定义一个函数来计算百分位数组:
percentile_group <- function(x)
{
y <- as.numeric(x) %>% discard(is.na)
qn <- quantile(y, probs = seq(0, 1, by= 0.1), na.rm = TRUE) %>% unique()
grp <- cut(x, breaks=qn, include.lowest=T, labels=F)
#return(qn)
return(grp)
}
Now use the function in a mutate statement as 现在使用mutate语句中的函数
mutate_if(zen, is.numeric, funs(pctile = percentile_group))
The output is: 输出是:
# A tibble: 20 x 6
col1 col2 col3 col1_pctile col2_pctile col3_pctile
<dbl> <dbl> <dbl> <int> <int> <int>
1 0 0.122 0.228 1 4 9
2 0.236 0.180 0.265 8 7 10
3 0.438 0.306 0.190 9 9 5
4 0 0 0.564 1 1 10
5 NaN NaN NaN NA NA NA
6 0.199 0.239 0.117 8 8 2
7 0.0755 0.0626 0.136 3 2 3
8 0.102 0.162 0.137 4 6 3
9 0.0459 0.0613 0.227 2 2 8
10 0.256 0.220 0.216 9 8 8
11 0.0881 0.136 0.241 3 4 9
12 0.117 NaN 0.108 5 NA 2
13 0.128 0 0.0681 7 1 1
14 0 0 0.195 1 1 5
15 0.118 0.0934 NaN 6 3 NA
16 0.115 0.140 0.208 4 5 7
17 0.0584 0.142 0.200 2 5 6
18 0.123 0 0.140 6 1 4
19 0.197 0.177 0.202 7 7 6
20 0.0370 0.254 0.0927 1 9 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.