Input dt
-
dt <- data.frame(a_check=c(1,2,1,1,2),
b_check=c(0,1,NA,1,15),
c_check=c(1,0,0,1,NA),
d_check=c(1,1,1,0,0),
e_check=c(1,NA,0,1,1))
Validation Lists-
valid_values <- list(a_check= c(1,2,3), b_check= c(0,1),c_check=c(0,1,2),d_check="possitive integer",e_check="positive integer")
required_list <- list(a_check= 1, b_check= 1,c_check=0,d_check=1,e_check=0)
col_type_list <- list(a_check= "factor", b_check= "factor",c_check="continuous",d_check="continuous",e_check="continuous")
Question-
I am trying to obtain below desired output, by using multiple ifelse
conditions as below-
variable
is required in required_list
and dt
contains NA
for that column than its should give error
(variable cannot be NA because it is required).variable
is continuous
in col_type_list
than it should contain only positive values in dt
else (variable must be a positive integer)variable
is factor
in col_type_list
than its should match the value in valid_value
list else (variable must be one of the following values). I am able to obtain result using nested for loops
but it is not efficient at all for large data set.
My Code-
param_names <- colnames(dt)
error_msg <- list()
error <- list()
for(i in 1:nrow(dt)){
for(j in 1:length(param_names))
{
if(get(param_names[j],required_list) %in% 1 & is.na(as.numeric(unlist(dt[param_names[j]]))[i]) == TRUE)
{
error_msg[j] <- paste0(toupper(param_names[j]), " cannot be NA because it is required")
}
## continuous variable check
else if(get(param_names[j],col_type_list)=="continuous"){
if (is.na(as.numeric(unlist(dt[param_names[j]]))[i]) | as.numeric(unlist(dt[param_names[j]]))[i] < 0) {
error_msg[j] <- paste0(toupper(param_names[j]), " must be a positive integer")
} else {
error_msg[j] <- NA
}
} else {
## factor variable check
if(!(as.numeric(unlist(dt[param_names[j]]))[i] %in% get(param_names[j],valid_values))){
error_msg[j] <- paste0(toupper(param_names[j]), " must be one of the following values ", paste(get(param_names[j],valid_values), collapse = '-'))
} else {
error_msg[j] <- NA
}
}
} ## end of inner for loop
error[i] <- paste(unlist(error_msg),collapse = " & ")
}## end of inner f
final_error <- unlist(error)
setDT(dt)
dt[,error := final_error]
dt[,error := gsub("NA & | NA \\s+ &", "\\1", error)]
dt[,error := gsub("& \\s+ NA | & NA", "\\1", error)]
Output-
> dt
a_check b_check c_check d_check e_check error
1: 1 0 1 1 1 NA
2: 2 1 0 1 NA E_CHECK must be a positive integer
3: 1 NA 0 1 0 B_CHECK cannot be NA
4: 1 1 1 0 1 NA
5: 2 15 NA 0 1 B_CHECK must be one of the following values 0-1 & C_CHECK must be a positive integer
Note - I know it can be achieved using something similar solution by @Jav
dt[, error := lapply(param_names, function(x) {
((get(x, dt) %in% get(x, valid_values))) %>%
ifelse(., " ", paste(x, "should have valid values like -", paste(get(x, valid_values), collapse = " ")))
}) %>% Reduce(paste, .)]
But, I am struggling to use multiple ifelse
condition using above solution. I am looking for efficient and clean solution to avoid for loops
. Any other method would also work.
You can get rid of the nested loops, but there will still be a lot of code to write. The cleanest way, in my opinion, is to write a custom function that defines how the logic is applied:
library(tidyverse)
check_col_validity <- function(col, name) {
r_error <- rep(NA, length(col))
# is required?
if (required_list[name] == 1) {
msg <- paste(toupper(name), "is required")
r_error <- ifelse(is.na(col), msg, NA)
}
# is continuous?
if (col_type_list[name] == "continuous") {
msg <- paste(toupper(name), "must be positive")
new_error <- ifelse(col < 0 | is.na(col), msg, NA)
error <- ifelse(is.na(r_error), new_error, paste(r_error, new_error, sep = " & "))
}
# is in valid range?
if (col_type_list[name] == "factor") {
valid_range <- valid_values[[name]]
msg <- paste(toupper(name), "must be one of", paste(valid_range, collapse = ", "))
new_error <- ifelse(col %in% valid_range, NA, msg)
error <- ifelse(is.na(r_error), new_error, r_error)
}
return(error)
}
This is a lot like the logic you had. The difference is in how it is applied to the data:
dt$error <- dt[, 1:5] %>%
purrr::imap_dfc(check_col_validity) %>%
t() %>%
as_tibble() %>%
purrr::map_chr(paste, collapse = " & ") %>%
stringr::str_remove_all("NA & ") %>%
stringr::str_remove_all(" & NA")
The function is applied to every column using purrr::imap
. The result is transposed and pasted together, after which the final step is to remove the ugly NA strings. It delivers the expected result, and I hope the code is clearer to the eye.
The main part about this process is how imap
works. It is an apply-type operation over a list, but it passes the names of the list elements as the second parameter to the function. This means you can write a custom function that is applied to every column of a dataframe, and add a second parameter to the function to which imap
will pass the column's name. Once you have both the column's data and name available inside the function, the function becomes a lot easier to write.
The custom function returns the error messages that apply to that column. This means you get a data frame with the same dimensions as your original dataset. You can then transpose this data frame and paste the results for each column together to get 1 message per row.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.