简体   繁体   English

根据R中的条件创建新的列变量

[英]Create new column variables based upon condition in R

I don't know why this is not working. 我不知道为什么这不起作用。 I have tried it various ways and it just does not work. 我已经尝试了各种方法,但是它不起作用。 It's not that I get an error using the if-statements I made myself, but it doesn't apply right. 这并不是说我使用自己编写的if语句会出错,但这并不适用。

Basically, there is a column Data$Age and a column Data$Age2 . 基本上,有一列Data$Age和一列Data$Age2

If Data$Age is value 50 - 100, I want Data$Age2 to say "50-100 Years" for that particular row. 如果Data$Age的值是50-100,我希望Data$Age2为该特定行说“ 50-100年”。

Likewise, if Data$Age is 25-50, I want Data$Age2 to say "25-50 Years" for the rows to which it applies. 同样,如果Data$Age为25-50,我希望Data$Age2对其适用的行说“ 25-50年”。

What would the cleanest way to go about doing this in R? 在R中执行此操作的最干净方法是什么?

dplyr may have the cleanest solution to this dplyr可能对此有最干净的解决方案

Using Len Greski's sample data below... 使用下面的Len Greski的样本数据...

data <- data.frame(Age1 = round(runif(100)*100,0))

data%>%
mutate(Age2 = ifelse(between(Age1, 25, 50), "25 - 50 Years", 
              ifelse(between(Age1, 51, 100),"51 - 100 Years", "Less than 25 years old")))

Assuming you only want two values for the column. 假设您只需要该列的两个值。 ifelse() is not efficient for more than two matches, say 100, though. ifelse()对于两个以上的匹配(例如100 ifelse()效率不高。 I'll have to think of an alternative approach in the event that its not. 如果没有,我将不得不考虑一种替代方法。

EDIT: or as Len has suggested below this, in a comment. 编辑:或如Len在下面的注释中所建议。

data%>% 
mutate(Age2 = cut(Age1,c(24,50,100),c("25-50 years","51-100 Years")))

So far, all the posted answers by Len Greski and InfiniteFlashChess in first place have suggested to use repeated subsetting statements or repeated calls to ifelse() for each age range. 到目前为止, Len GreskiInfiniteFlashChess首先发布的所有答案都建议对每个年龄段使用重复的子集语句或对ifelse()重复调用。

IMHO, this can't be considered as clean because it doesn't scale well with the number of age ranges. 恕我直言,这不能被认为是干净的,因为它不能很好地适应年龄范围的数量。 The only data-driven solution suggested by Onyambu in his comment is to use the cut() function from base R. Onyambu在其评论中建议的唯一数据驱动解决方案是使用基数R中的cut()函数。

Here, I suggest another data-driven solution which uses a lookup table with the lower and upper bounds of the age ranges and the associated labels and which updates in a non-equi join . 在这里,我建议另一个数据驱动的解决方案,该解决方案使用具有年龄范围的上下限和相关标签的查找表 ,并在非等额联接中进行更新 This will allows us to specify an arbitrary number of ranges without any changes to the code: 这将允许我们指定任意数量的范围,而无需更改代码:

library(data.table)
# define lookup table
lookup <- data.table(
  lower = c(25L, 51L),
  upper = c(50L, 100L)
)
lookup[, label := sprintf("%i-%i Years", lower, upper)][]
  lower upper label 1: 25 50 25-50 Years 2: 51 100 51-100 Years 
# create sample data set
Data <- data.frame(Age = c(24:26, 49:52, 100:102))

# update in non-equi join
setDT(Data)[lookup, on =.(Age >= lower, Age <= upper), Age2 := label][]
  Age Age2 1: 24 NA 2: 25 25-50 Years 3: 26 25-50 Years 4: 49 25-50 Years 5: 50 25-50 Years 6: 51 51-100 Years 7: 52 51-100 Years 8: 100 51-100 Years 9: 101 NA 10: 102 NA 

Note that NA indicate gaps in the age ranges defined in the lookup table. 请注意, NA表示查找表中定义的年龄范围内的差距。

Benchmarking 标杆

InfiniteFlashChess has asked about benchmark results. InfiniteFlashChess询问了基准测试结果。

Any benchmarks will depend on the number of rows in Data as well as on the number of groups, ie, age ranges. 任何基准都将取决于Data的行数以及组数(即年龄范围)。 So, we will do benchmark runs for 100 and 1 M rows as well as for 2 groups (as specified by the OP) and 8 groups. 因此,我们将针对100和1 M行以及2组(由OP指定)和8组进行基准测试。

The benchmark code for 2 groups: 2组的基准代码:

library(data.table)
library(dplyr)
n_row <- 1E2L
set.seed(123L)
Data0 <- data.frame(Age = sample.int(105L, n_row, TRUE))

lookup <- data.table(
  lower = c(25L, 51L),
  upper = c(50L, 100L)
)
lookup[, label := sprintf("%i-%i Years", lower, upper)][]

microbenchmark::microbenchmark(
  ifelse = {
    copy(Data0) %>%
      mutate(Age2 = ifelse(between(Age, 25, 50), "25 - 50 Years", 
                           ifelse(between(Age, 51, 100), "51 - 100 Years", 
                                  "")))
  },
  cut = {
    copy(Data0) %>% 
      mutate(Age2 = cut(Age, c(24,50,100), c("25-50 years","51-100 Years")))
  },
  baseR = {
    data <- copy(Data0)
    data$age2 <- ""
    data$age2[data$Age %in% 51:100] <- "51 - 100 years"
    data$age2[data$Age %in% 25:50] <- "25 - 50 years"
  },
  join_dt = {
    Data <- copy(Data0)
    setDT(Data)[lookup, on =.(Age >= lower, Age <= upper), Age2 := label]
  },
  times = 100L
)

Benchmark results for 100 rows: 100行的基准结果:

 Unit: microseconds expr min lq mean median uq max neval cld ifelse 2280.588 2415.006 2994.83792 2501.8495 2827.513 20545.672 100 c cut 2272.280 2407.455 2716.67432 2537.3425 2827.135 7351.495 100 c baseR 57.016 83.446 94.80729 91.1865 106.667 164.248 100 a join_dt 1165.970 1318.889 1560.19394 1485.4025 1691.939 2803.159 100 b 

Benchmark results for 1 M rows: 100万行的基准结果:

 Unit: milliseconds expr min lq mean median uq max neval cld ifelse 618.08286 626.72757 672.28875 639.04973 758.83435 773.25566 10 c cut 197.16467 200.53571 219.58635 203.77460 214.24227 343.56061 10 b baseR 52.96059 59.36964 76.09431 62.19055 66.32506 198.73654 10 a join_dt 66.89256 67.61147 73.33428 72.55457 78.18675 81.69146 10 a 

Benchmarking for 8 groups requires to write nested ifelse() or repeated subset operations: 对8个组进行基准测试需要编写嵌套的ifelse()或重复的子集操作:

breaks <- seq(20, 100, 10)

lookup <- data.table(
  lower = head(breaks, -1L),
  upper = tail(breaks, -1L)
)
lookup[, label := sprintf("%i-%i Years", lower + 1L, upper)][]


microbenchmark::microbenchmark(
  ifelse = {
    copy(Data0) %>%
      mutate(
        Age2 = ifelse(
          between(Age, 21, 30), "21 - 20 Years", ifelse(
            between(Age, 31, 40), "31 - 40 Years", ifelse(
              between(Age, 41, 50), "41 - 50 Years", ifelse(
                between(Age, 51, 60), "51 - 60 Years", ifelse(
                  between(Age, 61, 70), "61 - 70 Years", ifelse(
                    between(Age, 71, 80), "71 - 80 Years", ifelse(
                      between(Age, 81, 90), "81 - 90 Years", ifelse(
                        between(Age, 91, 100), "91 - 100 Years", "")))))))))
  },
  cut = {
    copy(Data0) %>% 
      mutate(Age2 = cut(Age, breaks))
  },
  subset = {
    data <- copy(Data0)
    data$age2 <- ""
    data$age2[data$Age %in% 21:30] <- "21 - 30 years"
    data$age2[data$Age %in% 31:40] <- "31 - 40 years"
    data$age2[data$Age %in% 41:50] <- "41 - 50 years"
    data$age2[data$Age %in% 51:60] <- "51 - 60 years"
    data$age2[data$Age %in% 61:70] <- "61 - 70 years"
    data$age2[data$Age %in% 71:80] <- "71 - 80 years"
    data$age2[data$Age %in% 81:90] <- "81 - 90 years"
    data$age2[data$Age %in% 91:100] <- "91 - 100 years"
  },
  join_dt = {
    Data <- copy(Data0)
    setDT(Data)[lookup, on =.(Age > lower, Age <= upper), Age2 := label]
  },
  times = 100L
)

Benchmark results for 100 rows: 100行的基准结果:

 Unit: microseconds expr min lq mean median uq max neval cld ifelse 2522.617 2663.832 2955.2448 2740.1030 2886.4155 7717.748 100 d cut 2340.622 2470.699 2664.9381 2538.6635 2646.6520 7608.627 100 c subset 174.820 199.741 219.6505 210.5015 231.4575 402.501 100 a join_dt 1198.819 1290.949 1406.2354 1399.1255 1488.4240 1810.500 100 b 

Benchmark results for 1 M rows: 100万行的基准结果:

 Unit: milliseconds expr min lq mean median uq max neval cld ifelse 2427.0599 2429.42131 2539.88611 2457.06191 2565.14682 2992.68891 10 c cut 220.3553 221.53939 243.49476 222.82165 230.06289 406.57277 10 b subset 176.0096 177.92958 199.13398 184.26878 192.60274 323.90338 10 b join_dt 62.7471 64.26875 67.94099 65.07508 75.03169 75.38813 10 a 

Here is a solution using base R. Note that since age2 cannot simultaneously be 25 - 50 and 50 - 100 I made the categories mutually exclusive: 这是使用基数R的解决方案。请注意,由于age2不能同时为25 - 5050 - 100我将类别互斥:

data <- data.frame(age = round(runif(100)*100,0),
                   age2 = rep(" ",100),stringsAsFactors=FALSE)
data$age2[data$age %in% 51:100] <- "51 - 100 years"
data$age2[data$age %in% 25:50] <- "25 - 50 years"
data[1:15,]

...and the output: ...以及输出:

> data[1:15,]
   age           age2
1    0               
2   45  25 - 50 years
3   58 51 - 100 years
4   59 51 - 100 years
5   84 51 - 100 years
6   79 51 - 100 years
7    5               
8   78 51 - 100 years
9   46  25 - 50 years
10   6               
11  73 51 - 100 years
12  37  25 - 50 years
13   5               
14  41  25 - 50 years
15  58 51 - 100 years
> 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM