简体   繁体   中英

Creating a new column based on two old columns in a data frame

data <- data.frame(foo = c(0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1),
                   bar = c(1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0))

Hi, Here I am having a data frame with two columns foo and bar. I want to create a new column Complete, based on foo and bar data.

  • If foo and bar is zero then complete should be 0.
  • If foo is one and bar is 0 then complete should be one.
  • If bar is 1 and foo is 0 then complete should be two.

For example.

foo   bar complete
0     0   0
1     0   1
0     1   2

Edit:

If foo==1 and bar==1 then NA .

Following suit, using NA when both columns are 1. Start with the row sums. If any of them are 2 (the number of columns), replace it with NA . Then multiply that by the max.col() value.

rs <- rowSums(data)
cbind(data, complete = max.col(data) * replace(rs, rs == 2, NA))
#    foo bar complete
# 1    0   1        2
# 2    1   0        1
# 3    0   0        0
# 4    0   0        0
# 5    1   1       NA
# 6    0   0        0
# 7    0   1        2
# 8    0   0        0
# 9    1   0        1
# 10   1   1       NA
# 11   1   0        1

If you don't wish to assign new objects, you can use a local environment or wrap it up into a function:

local({
    rs <- rowSums(data)
    max.col(data) * replace(rs, rs == 2, NA)
})
# [1]  2  1  0  0 NA  0  2  0  1 NA  1

If an algebraic approach is sought, we can try one of the lines below:

with(data, 2L * bar + foo + 0L * NA^(bar & foo))
with(data, 2L * bar + foo + NA^(bar & foo) - 1L)
with(data, (2L * bar + foo) * NA^(bar & foo))

All return

 [1] 2 1 0 0 NA 0 2 0 1 NA 1 

Explanation

The expression 2L * bar + foo is treating bar and foo as digits of a binary number. The difficulty is to return NA in case of foo == 1 & bar == 1 . For that, bar and foo are treated as logical values. If both are 1 , ie, TRUE then NA^(bar & foo) returns NA , otherwise 1 .

If one operand of an expression is NA so is the overall expression. So, there are several possibilities to combine NA^(bar & foo) with 2L * bar + foo . I wonder which is the fastest.

Benchmark

So far, 7 different approaches have been posted by

The OP has supplied his sample data as type double . As I have seen remarkable different timings for integer and double values on other occasions, the benchmark runs will be repeated for each type to investigate the impact of data type on the different approaches.

Benchmark data

The benchmark data will consist of 1 million rows:

n_row <- 1e6L
set.seed(1234L)
data_int <- data.frame(foo = sample(0:1, n_row, replace = TRUE),
                       bar = sample(0:1, n_row, replace = TRUE))
with(data_int, table(foo, bar))
  bar foo 0 1 0 249978 250330 1 249892 249800 
data_dbl <- data.frame(foo = as.double(data_int$foo),
                       bar = as.double(data_int$bar))

Benchmark code

For benchmarking, the microbenchmark package is used.

# define check function to compare results
check <- function(values) {
  all(sapply(values[-1], function(x) all.equal(values[[1]], x)))
}

library(dplyr)
data <- data_dbl
microbenchmark::microbenchmark(
  d.b = {
    vect = c("0 0" = 0, "1 0" = 1, "0 1" = 2)
    unname(vect[match(with(data, paste(foo, bar)), names(vect))])
  },
  Balter = with(data,ifelse(foo == 0 & bar == 0, 0,
                            ifelse(foo == 1 & bar == 0, 1,
                                   ifelse(foo == 0 & bar == 1, 2, NA)))),
  PoGibas = with(data, case_when(foo == 0 & bar == 0 ~ 0,
                                   foo == 1 & bar == 0 ~ 1,
                                   foo == 0 & bar == 1 ~ 2)),
  Rich = local({rs = rowSums(data);  max.col(data) * replace(rs, rs == 2, NA)}),
  Frank = with(data, ifelse(xor(foo, bar), max.col(data), 0*NA^foo)),
  user20650 = with(data, c(0, 1, 2, NA)[c(2*bar + foo + 1)]),
  uwe1i = with(data, 2L * bar + foo + 0L * NA^(bar & foo)),
  uwe1d = with(data, 2  * bar + foo + 0  * NA^(bar & foo)),
  uwe2i = with(data, 2L * bar + foo + NA^(bar & foo) - 1L),
  uwe2d = with(data, 2  * bar + foo + NA^(bar & foo) - 1),
  uwe3i = with(data, (2L * bar + foo) * NA^(bar & foo)),
  uwe3d = with(data, (2  * bar + foo) * NA^(bar & foo)),
  times = 11L,
  check = check)

Note that only the result vector is created without creating a new column in data . The approach of PoGibas was modified accordingly.

As mentioned above, there might be speed differences in using integer or double values. Therefore, I wanted to test also the effect of using integer constant, eg, 0L, 1L , versus double constants 0, 1 .

Benchmark results

First, for input data of type double :

 Unit: milliseconds expr min lq mean median uq max neval cld db 1687.05063 1700.52197 1707.72896 1706.48511 1715.46814 1730.62160 11 e Balter 287.89649 377.42284 412.59764 452.75668 458.21178 472.92971 11 d PoGibas 152.90900 154.82164 176.09522 158.23214 165.73524 333.48223 11 c Rich 67.43862 68.68331 76.42759 77.10620 82.42179 89.90016 11 b Frank 170.78293 174.66258 192.85203 179.69422 184.55237 333.74578 11 c user20650 20.11790 20.29744 22.32541 20.81453 21.11509 34.45654 11 a uwe1i 24.86296 25.13935 28.38634 25.60604 28.79395 45.53514 11 a uwe1d 24.90034 25.05439 28.62943 25.41460 29.47379 41.08459 11 a uwe2i 25.21222 25.59754 30.15579 26.29135 33.00361 47.13382 11 a uwe2d 24.38305 25.09385 29.46715 25.41951 29.11112 45.05486 11 a uwe3i 23.27334 23.95714 27.12474 24.28073 25.86336 44.40467 11 a uwe3d 23.23332 23.65073 27.60330 23.96620 29.53911 40.41175 11 a 

Now, for input data of type integer :

 Unit: milliseconds expr min lq mean median uq max neval cld db 591.71859 596.31904 607.51452 601.24232 617.13886 636.51405 11 e Balter 284.08896 297.06170 374.42691 303.14888 465.27859 488.19606 11 d PoGibas 151.75851 155.28304 174.31369 159.18364 163.50864 329.00412 11 c Rich 67.79770 71.22311 78.38562 77.46642 84.56777 96.55540 11 b Frank 166.60802 170.34078 192.19833 180.09257 182.43584 350.86681 11 c user20650 19.79204 20.06220 21.95963 20.18624 20.42393 30.13135 11 a uwe1i 27.54680 27.83169 32.36917 28.08939 37.82286 45.21722 11 ab uwe1d 22.60162 22.89350 25.94329 23.10419 23.74173 47.39435 11 a uwe2i 27.05104 27.57607 27.80843 27.68122 28.02048 28.88193 11 a uwe2d 22.83384 22.93522 23.22148 23.12231 23.41210 24.18633 11 a uwe3i 25.17371 26.44427 29.34889 26.68290 27.08276 47.71379 11 a uwe3d 21.68712 21.83060 26.16276 22.37659 28.40750 43.33989 11 a 

For both integer and double input values, the approach by user20650 is the fastest. Next are my algebraic approaches. Third is Rich s solution but three times slower than the second.

The type of input data has the strongest impact on db 's solution and to a lesser extent on Balter 's. The other solutions seem to be rather invariant.

Interestingly, there seems to be no remarkable difference from using integer or double constants in my algebraic solutions.

You can create a named vector ( vect in this example) and lookup values from that vector using match

vect = c("0 0" = 0, "1 0" = 1, "0 1" = 2)
unname(vect[match(with(data, paste(foo, bar)), names(vect))])
# [1]  2  1  0  0 NA  0  2  0  1 NA  1

There's a lot of ways to do this, some more efficient depending on how many conditions you have. But a basic way is:

data$New_Column <- with(data,ifelse(foo == 0 & bar == 0, 0,
                         ifelse(foo == 1 & bar == 0, 1,
                         ifelse(foo == 0 & bar == 1, 2, NA))))

#   foo bar New_Column
#1    0   1          2
#2    1   0          1
#3    0   0          0
#4    0   0          0
#5    1   1         NA
#6    0   0          0
#7    0   1          2
#8    0   0          0
#9    1   0          1
#10   1   1         NA
#11   1   0          1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM