简体   繁体   English

在将多个数据帧列映射到R中的百分位值时处理NA

[英]Handling NAs when mapping several dataframe columns into their percentile values in R

> dput(zed)
    structure(list(col1 = c(0, 0.236258076229343, 0.43840483531742, 
    0, NaN, 0.198838380845137, 0.0754815882584196, 0.10176020461209, 
    0.045933014354067, 0.256237616143739, 0.0880658828009711, 0.117285153415946, 
    0.127902400629673, 0, 0.117682083253069, 0.114542851298834, 0.0584035686594367, 
    0.123456790123457, 0.196817420435511, 0.0369541251378046), col2 = c(0.121951219512195, 
    0.17979731938542, 0.305944055944056, 0, NaN, 0.239463601532567, 
    0.0625521267723103, 0.161729656111679, 0.0612745098039216, 0.22002200220022, 
    0.135608048993876, NaN, 0, 0, 0.0934420659191301, 0.140091696383087, 
    0.141872719902716, 0, 0.176720075400566, 0.253924284395199), 
        col3 = c(0.227540305157712, 0.264931804641559, 0.190018713264226, 
        0.564015792442188, NaN, 0.116857208286359, 0.136034761917893, 
        0.137370134394451, 0.227357158778513, 0.215714919326088, 
        0.240671647524362, 0.107512520868114, 0.0681162324911809, 
        0.195274360476469, NaN, 0.208033156719459, 0.199848016844409, 
        0.140383517621937, 0.202430694674985, 0.0927417625979096)), row.names = c(NA, 
    -20L), class = c("tbl_df", "tbl", "data.frame"))

> zed
# A tibble: 20 x 3
       col1     col2     col3
      <dbl>    <dbl>    <dbl>
 1   0        0.122    0.228 
 2   0.236    0.180    0.265 
 3   0.438    0.306    0.190 
 4   0        0        0.564 
 5 NaN      NaN      NaN     
 6   0.199    0.239    0.117 
 7   0.0755   0.0626   0.136 
 8   0.102    0.162    0.137 
 9   0.0459   0.0613   0.227 
10   0.256    0.220    0.216 
11   0.0881   0.136    0.241 
12   0.117  NaN        0.108 
13   0.128    0        0.0681
14   0        0        0.195 
15   0.118    0.0934 NaN     
16   0.115    0.140    0.208 
17   0.0584   0.142    0.200 
18   0.123    0        0.140 
19   0.197    0.177    0.202 
20   0.0370   0.254    0.0927

I have the following dataframe, which has multiple columns (col1, col2, col3) for which I need to convert into percentiles (rounded to the nearest integer, so one of 1:100). 我有以下数据帧,它有多个列(col1, col2, col3) ,我需要将其转换为百分位数(四舍五入到最接近的整数,因此1:100之一)。 My preference - and what I assume is easiest - is to add 3 additional columns col1pctile, col2pctile, col3pctile that maps each respective column to their percentile value (within that column). 我的偏好 - 以及我认为最简单的 - 是添加3个额外的列col1pctile, col2pctile, col3pctile ,它们将每个相应的列映射到它们的百分位值(在该列中)。

Using the fmsb::percentile() function on a single column returns an error due to the presence of NAs. 在单个列上使用fmsb::percentile()函数会因为存在NA而返回错误。

> fmsb::percentile(zed$col1)
Error in quantile.default(dat, probs = seq(0, 1, by = 0.01), type = 7) : 
  missing values and NaN's not allowed if 'na.rm' is FALSE

Although the example dataframe above only has 20 rows, my actual dataframe is many more rows than just 20, and having percentile values actually makes sense for my use-case (whereas percentiles wouldn't make sense for only 20 rows). 虽然上面的示例数据帧只有20行,但我的实际数据帧比20行多得多,并且具有百分位值实际上对我的用例有意义(而百分位数对于仅20行没有意义)。

I will edit this post shortly with my current attempts, which aren't working as I'd hope. 我将很快用我目前的尝试编辑这篇文章,这些尝试并不像我希望的那样有效。 Any help with this would be greatly appreciated! 任何有关这方面的帮助将不胜感激!

There are two challenges when using the percentile function from the fmsb . 使用fmsbpercentile函数时存在两个挑战。 First, it cannot handle missing values. 首先,它无法处理缺失值。 Second, it cannot handle zero. 其次,它无法处理零。

Here is the code of the percentile function. 这是百分位函数的代码。

library(dplyr)
library(fmsb)

percentile
# function (dat) 
# {
#   pt1 <- quantile(dat, probs = seq(0, 1, by = 0.01), type = 7)
#   pt2 <- unique(as.data.frame(pt1), fromLast = TRUE)
#   pt3 <- rownames(pt2)
#   pt4 <- as.integer(strsplit(pt3, "%"))
#   datp <- pt4[as.integer(cut(dat, c(0, pt2$pt1), labels = 1:length(pt3)))]
#   return(datp)
# }
# <bytecode: 0x0000000016c498b0>
#   <environment: namespace:fmsb>

As you can see, there are no ways to specify the na.rm argument to the quantile function. 如您所见,没有办法为quantile函数指定na.rm参数。 However, simply set na.rm = TRUE to quantile function will not work because we would like the function to return NA when the input numbers are NA . 然而,简单地设置na.rm = TRUEquantile功能将无法正常工作,因为我们想返回功能NA当输入数字是NA

In addition, when providing a vector with zero, the function returns error as follows. 另外,当向量提供零时,该函数返回如下错误。

percentile(0:5)
# Error in cut.default(dat, c(0, pt2$pt1), labels = 1:length(pt3)) : 
#  'breaks' are not unique

My suggestion is to re-write the function to be able to return NA for NA input values, and add a small numbers for zero. 我的建议是重写函数,以便能够为NA输入值返回NA ,并为零添加一个小数字。 Here is my modification for the function. 这是我对该功能的修改。 I called it percentile_narm_zero . 我叫它为percentile_narm_zero

percentile_narm_zero <- function(dat, small = 0.0000000000001){

  # Create a data frame with the numeric values and index
  dat2 <- data.frame(index = 1:length(dat), dat = dat)
  # Remove NA
  dat3 <- dat2[ !is.na(dat2$dat), ]
  # Add a small number to 0
  dat3$dat <- ifelse(dat3$dat == 0, dat3$dat + small, dat3$dat)

  # This part is the same as the percentile function
  pt1 <- quantile(dat3$dat, probs = seq(0, 1, by = 0.01), type = 7)
  pt2 <- unique(as.data.frame(pt1), fromLast = TRUE)
  pt3 <- rownames(pt2)
  pt4 <- as.integer(strsplit(pt3, "%"))
  datp <- pt4[as.integer(cut(dat3$dat, c(0, pt2$pt1)), labels = 1:length(pt3))]

  # Merge datp back to dat2
  dat3$datp <- datp
  dat4 <- merge(dat2, dat3, by = "index", all = TRUE)

  return(dat4$datp)
}

Now we can apply this function to all columns in zed using mutate_all . 现在我们可以使用mutate_all将此函数应用于zed所有列。

zed2 <- zed %>% mutate_all(funs(pctile = percentile_narm_zero(.)))
# A tibble: 20 x 6
#       col1     col2     col3 col1_pctile col2_pctile col3_pctile
#      <dbl>    <dbl>    <dbl>       <int>       <int>       <int>
#  1   0        0.122    0.228           11          42          83
#  2   0.236    0.180    0.265           89          77          95
#  3   0.438    0.306    0.190          100         100          42
#  4   0        0        0.564           11          17         100
#  5 NaN      NaN      NaN               NA          NA          NA
#  6   0.199    0.239    0.117           84          89          18
#  7   0.0755   0.0626   0.136           34          30          24
#  8   0.102    0.162    0.137           45          65          30
#  9   0.0459   0.0613   0.227           23          24          77
# 10   0.256    0.220    0.216           95          83          71
# 11   0.0881   0.136    0.241           39          48          89
# 12   0.117  NaN        0.108           56          NA          12
# 13   0.128    0        0.0681          73          17           0
# 14   0        0        0.195           11          17          48
# 15   0.118    0.0934 NaN               62          36          NA
# 16   0.115    0.140    0.208           50          53          65
# 17   0.0584   0.142    0.200           28          59          53
# 18   0.123    0        0.140           67          17          36
# 19   0.197    0.177    0.202           78          71          59
# 20   0.0370   0.254    0.0927          17          95           6

First define a function to calculate percentile group as: 首先定义一个函数来计算百分位数组:

percentile_group <- function(x)
{
  y <- as.numeric(x) %>% discard(is.na)
  qn <- quantile(y, probs = seq(0, 1, by= 0.1), na.rm = TRUE) %>% unique()
  grp <- cut(x, breaks=qn, include.lowest=T, labels=F)
  #return(qn)
  return(grp)
}

Now use the function in a mutate statement as 现在使用mutate语句中的函数

 mutate_if(zen, is.numeric, funs(pctile = percentile_group))

The output is: 输出是:

# A tibble: 20 x 6
col1     col2     col3 col1_pctile col2_pctile col3_pctile
<dbl>    <dbl>    <dbl>       <int>       <int>       <int>
  1   0        0.122    0.228            1           4           9
2   0.236    0.180    0.265            8           7          10
3   0.438    0.306    0.190            9           9           5
4   0        0        0.564            1           1          10
5 NaN      NaN      NaN               NA          NA          NA
6   0.199    0.239    0.117            8           8           2
7   0.0755   0.0626   0.136            3           2           3
8   0.102    0.162    0.137            4           6           3
9   0.0459   0.0613   0.227            2           2           8
10   0.256    0.220    0.216            9           8           8
11   0.0881   0.136    0.241            3           4           9
12   0.117  NaN        0.108            5          NA           2
13   0.128    0        0.0681           7           1           1
14   0        0        0.195            1           1           5
15   0.118    0.0934 NaN                6           3          NA
16   0.115    0.140    0.208            4           5           7
17   0.0584   0.142    0.200            2           5           6
18   0.123    0        0.140            6           1           4
19   0.197    0.177    0.202            7           7           6
20   0.0370   0.254    0.0927           1           9           1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM