简体   繁体   中英

Replace multiple NA values with a correct values from a different set of columns

I have a dataframe with missing data for some rows. The correct data can be found in another set of columns. I want to replace the NAs with the correct data.

My data looks like this:

df <- data.frame(M_1=c(1,NA,3,NA,6),
           M_2=c(5,NA,3,NA,1),
           M_3=c(6,NA,2,NA,4),
           M_C_1=c(NA,2,NA,6,NA),
           M_C_2=c(NA,1,NA,4,NA),
           M_C_3=c(NA,7,NA,3,NA))
df
#   M_1 M_2 M_3 M_C_1 M_C_2 M_C_3
#1   1   5   6    NA    NA    NA
#2  NA  NA  NA     2     1     7
#3   3   3   2    NA    NA    NA
#4  NA  NA  NA     6     4     3
#5   6   1   4    NA    NA    NA

For all records, I either have a complete set of records for variables M_1 , M_2 , and M_3 or I have a complete set for variables M_C_1 , M_C_2 , and M_C_3 .

For each row that has NAs in the first set of variables (M_1:M_3), I would like to replace with the values from the second set of values (M_C_1:M_C_2). I dont need to retain the second set of values. So my desired data frame would look like:

df
#   M_1 M_2 M_3
#1   1   5   6    
#2   2   1   7
#3   3   3   2
#4   6   4   3     
#5   6   1   4

My real dataset contains many columns in this notation, so I need a general solutions (ie, I dont want to refer to each column individually).

I would like to do this with dplyr if possible.

You could use map2 + coalesce :

library(dplyr)
library(purrr)

map2_dfc(select(df, 1:3), select(df, 4:6), coalesce)

# # A tibble: 5 × 3
#     M_1   M_2   M_3
#   <dbl> <dbl> <dbl>
# 1     1     5     6
# 2     2     1     7
# 3     3     3     2
# 4     6     4     3
# 5     6     1     4

Here's a generic example that works if the columns have names that allow them to be identified and they are in the correct order.

library(dplyr)

df <- data.frame(M_1=c(1,NA,3,NA,6),
                 M_2=c(5,NA,3,NA,1),
                 M_3=c(6,NA,2,NA,4),
                 M_C_1=c(NA,2,NA,6,NA),
                 M_C_2=c(NA,1,NA,4,NA),
                 M_C_3=c(NA,7,NA,3,NA))

# make a temporary id so we can retain the order later
df <- df %>% mutate(temporary_id = 1:n())

# find the columns corresponding to the final data
# they are assumed to be of the form M_number
df_records <- 
    df %>% 
    select(matches('temporary_id|M_[0-9]+')) %>% 
    na.omit()

# find the extra columns with data to replace in the final data
# they are assumed to be of the form M_C_number
df_extra <- 
    df %>% 
    select(matches('temporary_id|M_C_[0-9]+')) %>% 
    na.omit()

# change the names of the extra columns to match the final data
# this only works if the columns are in the correct order in the original data frame
names(df_extra) <- names(df_records)

# bind the rows of the final and extra data, sort and remove the temporary id
final_df <- 
    df_records %>% 
    bind_rows(df_extra) %>% 
    arrange(temporary_id) %>% 
    select(-temporary_id)

final_df
#  M_1 M_2 M_3
#1   1   5   6
#2   2   1   7
#3   3   3   2
#4   6   4   3
#5   6   1   4

If they are not in the required order, some sorting could be done but I'll leave that for now.

Here is another option with dplyover

library(dplyover)
library(stringr)
df %>%
   transmute(across2(1:3, 4:6, coalesce, 
         .names_fn = ~ str_remove(.x, "(?<=\\d)_.*") ))
  M_1 M_2 M_3
1   1   5   6
2   2   1   7
3   3   3   2
4   6   4   3
5   6   1   4

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM