简体   繁体   中英

R - dplyr/purrr - Create new columns from function of pairs of existing columns

I've run into a stumbling block with dplyr::mutate a few times in that I can't figure out how to create new columns based on a function (eg, summing or anything else) that would create new columns based on all pairs of two input sets of columns. A partial demonstration is below:

#Input data
set.seed(100)
in_dat <- tibble(x1 = sample(x = c(1:10, NA_real_), size = 1000, replace = TRUE),
                 x2 = sample(x = c(1:10, NA_real_), size = 1000, replace = TRUE),
                 x3 = sample(x = c(1:10, NA_real_), size = 1000, replace = TRUE),
                 x4 = sample(x = c(1:10, NA_real_), size = 1000, replace = TRUE),
                 y1 = sample(x = c(1, 0, NA_real_), size = 1000, replace = TRUE),
                 y2 = sample(x = c(1, 0, NA_real_), size = 1000, replace = TRUE),
                 y3 = sample(x = c(1, 0, NA_real_), size = 1000, replace = TRUE),
                 y4 = sample(x = c(1, 0, NA_real_), size = 1000, replace = TRUE),
                 y5 = sample(x = c(1, 0, NA_real_), size = 1000, replace = TRUE),
                 y6 = sample(x = c(1, 0, NA_real_), size = 1000, replace = TRUE))

#Output data with 1 column pair; all pairs between x and y should be computed
out_dat_1col <- in_dat %>% 
  mutate(miss_x1y1 = if_else(is.na(x1) & is.na(y1), TRUE, FALSE))

This checks to see of pairs of x and y columns both have missing values and marks TRUE in the new column. This is only one pair though, and I'd like a way to do this for all pairs between x and y columns other than manually coding each of them in their own mutate line. I think purrr should be able to accomplish this, but I haven't figured out the proper syntax with the map variants or possibly reduce as well. I'm currently getting an error from both map2_dfc (to append the new columns on to the existing columns with bind_cols ) and reduce2 that .x (x variables) and .y (y variables) are not of consistent length, and I'm not sure how to circumvent this. Any thoughts are much appreciated.

#Produces error
out_dat <- in_dat %>% 
  bind_cols(map2_dfc(
    .x = in_dat %>% select(starts_with('x')),
    .y = in_dat %>% select(starts_with('y')),
    .f = ~if_else(is.na(.x) & is.na(.y), TRUE, FALSE)
  ))

Error: Mapped vectors must have consistent lengths:
* `.x` has length 4
* `.y` has length 6

Here's a short base R way to create the data frame using lapply , sapply and mapply :

all_cols <- lapply(in_dat, function(y) sapply(in_dat, function(x) is.na(y) & is.na(x)))
all_cols <- mapply(function(x, y) {colnames(x) <- paste(y, colnames(x), sep = "_"); x}, 
                   all_cols, names(all_cols), SIMPLIFY = FALSE)
df <- as_tibble(cbind(in_dat, do.call(cbind, all_cols)))
df
#> # A tibble: 1,000 x 110
#>       x1    x2    x3    x4    y1    y2    y3    y4    y5    y6 x1_x1 x1_x2 x1_x3 x1_x4
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <lgl> <lgl>
#>  1     3     7     2     5     1     1     0     1     0    NA FALSE FALSE FALSE FALSE 
#>  2     7     5    10     3    NA     0    NA    NA     0    NA FALSE FALSE FALSE FALSE
#>  3     3     3     3     7     1     1    NA     1     1     1 FALSE FALSE FALSE FALSE
#>  4     7     3     1     8     1    NA     1     0    NA     1 FALSE FALSE FALSE FALSE 
#>  5     5     2    10     7     0    NA    NA     0    NA     1 FALSE FALSE FALSE FALSE 
#>  6     7     8    10     8    NA     1     1     1     1     1 FALSE FALSE FALSE FALSE 
#>  7    10     8     3     5     0     1    NA     1     1     1 FALSE FALSE FALSE FALSE 
#>  8     1    10     5    10     1    NA    NA     0     1     1 FALSE FALSE FALSE FALSE
#>  9     7     2     5     9    NA     0     0    NA     1     1 FALSE FALSE FALSE FALSE
#> 10     8     9     1     4     1    NA    NA     1    NA     0 FALSE FALSE FALSE FALSE
#> # ... with 990 more rows, and 96 more variables

The only problem being that you have also checked each row against itself, so to remove them you could do something like this:

df <- df[sapply(strsplit(names(df), "_"), function(x) {!any(duplicated(x))})]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM