简体   繁体   中英

Compare cells and take NA as positive match

I have data like this:

> dput(daata)
structure(list(P1 = structure(c(1L, 1L, 3L, 3L, 5L, 5L, 5L, 5L, 
4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 2L, 2L), .Label = c("Apple", 
"Grape", "Orange", "Peach", "Tomato"), class = "factor"), P2 = structure(c(4L, 
4L, 3L, 3L, 5L, 5L, 5L, 5L, 6L, 6L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 
1L, 6L, 6L), .Label = c("Banana", "Cucumber", "Lemon", "Orange", 
"Potato", "Tomato"), class = "factor"), P1_location_subacon = structure(c(NA, 
NA, 1L, 1L, 1L, 1L, 1L, 1L, NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L), .Label = c("Fridge", "Table"), class = "factor"), 
    P1_location_all_predictors = structure(c(2L, 2L, 3L, 3L, 
    3L, 3L, 3L, 3L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
    3L), .Label = c("Table,Desk,Bag,Fridge,Bed,Shelf,Chair", 
    "Table,Shelf,Cupboard,Bed,Fridge", "Table,Shelf,Fridge"), class = "factor"), 
    P2_location_subacon = structure(c(1L, 1L, 1L, 1L, NA, NA, 
    NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Fridge", 
    "Shelf"), class = "factor"), P2_location_all_predictors = structure(c(3L, 
    3L, 2L, 2L, 1L, 1L, 1L, 1L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 3L, 
    3L, 3L, 3L, 3L), .Label = c("Shelf,Fridge", "Shelf,Fridge,Bed", 
    "Table,Shelf,Fridge"), class = "factor")), .Names = c("P1", 
"P2", "P1_location_subacon", "P1_location_all_predictors", "P2_location_subacon", 
"P2_location_all_predictors"), row.names = c(NA, -20L), class = "data.frame")

and I use the function below to compare cells (location):

daata$comp_subacon[mapply(setequal,strsplit(daata$P1_location_subacon, ","), strsplit(daata$P2_location_subacon, ","))] <- 1

What this function does ?

It compares if the text inside the cells are the same and if it's true it puts number 1 in the new column. The problem is that for some of the fruits/veggies location is unknown and in that case I would like to take it as a positive match so put number 1 in the next column. Unknown localization is marked as NA . Do you have any idea how to modify the function which I am currently using ? I can use the different one as well...

EDIT: First try of code:

> dput(daata_after_fun)
structure(list(P1 = structure(c(1L, 1L, 3L, 3L, 5L, 5L, 5L, 5L, 
4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 2L, 2L), .Label = c("Apple", 
"Grape", "Orange", "Peach", "Tomato"), class = "factor"), P2 = structure(c(4L, 
4L, 3L, 3L, 5L, 5L, 5L, 5L, 6L, 6L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 
1L, 6L, 6L), .Label = c("Banana", "Cucumber", "Lemon", "Orange", 
"Potato", "Tomato"), class = "factor"), P1_location_subacon = structure(c(NA, 
NA, 1L, 1L, 1L, 1L, 1L, 1L, NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L), .Label = c("Fridge", "Table"), class = "factor"), 
    P1_location_all_predictors = structure(c(2L, 2L, 3L, 3L, 
    3L, 3L, 3L, 3L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
    3L), .Label = c("Table,Desk,Bag,Fridge,Bed,Shelf,Chair", 
    "Table,Shelf,Cupboard,Bed,Fridge", "Table,Shelf,Fridge"), class = "factor"), 
    P2_location_subacon = structure(c(1L, 1L, 1L, 1L, NA, NA, 
    NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Fridge", 
    "Shelf"), class = "factor"), P2_location_all_predictors = structure(c(3L, 
    3L, 2L, 2L, 1L, 1L, 1L, 1L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 3L, 
    3L, 3L, 3L, 3L), .Label = c("Shelf,Fridge", "Shelf,Fridge,Bed", 
    "Table,Shelf,Fridge"), class = "factor"), comp_subacon = c(0, 
    0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), .Names = c("P1", 
"P2", "P1_location_subacon", "P1_location_all_predictors", "P2_location_subacon", 
"P2_location_all_predictors", "comp_subacon"), row.names = c(NA, 
-20L), class = "data.frame")

you can define a function

eq_or_na <- function( a , b ) (!is.na(a) & !is.na(b) & a==b) | (is.na(a) | is.na(b))

then the following should work:

daata$comp_subacon[eq_or_na(as.character(daata$P1_location_subacon), 
                            as.character(daata$P2_location_subacon))] <- 1

In case you have a set like in your variable P1_location_all_predictors , you can do:

seteq_or_na <- function( a , b ) (!any(is.na(a)) & !any(is.na(b)) & setequal(a, b)) | (all(is.na(a)) | all(is.na(b)))
daata$comp_subacon[mapply(seteq_or_na, 
                          strsplit(as.character(daata$P1_location_subacon), ","), 
                          strsplit(as.character(daata$P2_location_subacon), ","))] <- 1

For example with P1_location_all_predictors and P2_location_all_predictors , you can do, defining directly the new variable:

daata$comp_subacon_2 <- +(mapply(seteq_or_na, 
                                 strsplit(as.character(daata$P1_location_all_predictors), ","), 
                                 strsplit(as.character(daata$P2_location_all_predictors), ",")))

EDIT

If you want to know if there is at least one common location in between the 2 sets, you can define a new function:

inter_or_na <- function( a , b ) (!any(is.na(a)) & !any(is.na(b)) & length(intersect(a, b))) | (all(is.na(a)) | all(is.na(b)))

And then apply it on your 2 columns:

daata$comp_subacon_3 <- +(mapply(inter_or_na, 
                                 strsplit(as.character(daata$P1_location_all_predictors), ","), 
                                 strsplit(as.character(daata$P2_location_all_predictors), ",")))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM