简体   繁体   中英

Two datasets: How to check if the values of a column of a dataset are contained in another column of another dataset in R?

I have two datasets data1 and data2. It should be noted that my data1 contains 300 rows and data2 contains 5000 rows. Both datasets have a column named x2 (as you can see above). The x2 column of data2 contains 5000 values on the names of the cars and x2 of data1 contains just 300 names of the cars.
How to check that the x2 of data1 is contained in the x2 of data1?

data1 <- data.frame(x1 = c(1, 3, 7, 7, 4, 7),  
                    x2 = c("a 1-metha (akD)", "methal methal", "methy", "3-[3-(methy)prox", 
                         "3-carbon (C:H)", "z"),
                             x3 = 10:15)

data2 <- data.frame(x1 = c(1, 3, 7, 7, 4, 7),  
                    x2 = c("a 1-metha (akD)|a 1-metha akaikedenioyl|a 1-m(akD)", "methal methal|X.methal methal|methal (22)", "methy", "3-[3-(methy)prox", 
                         "3-carbon (C:H)", "y"),
                             x3 = 20:25)

I just started using the R language. But I tried with the grep function. I try to automate, to avoid doing it value after value.

matchedValue <- grep(str_extract(data1$x1[1], "([[:alnum:][:punct:][:blank:]]+)"), 
        str_extract(data2$x2, "([[:alnum:][:punct:][:blank:]]+)"),
        ignore.case = T)

I want to know if for example a 1-metha (akD) (Please see column x2 of data1) is also present in x2 of data2 and I want do it automatically for all 300 rows of data1.
How do I do this please?

library(tidyverse)

data1 %>% 
  mutate(in_data2 = x2 %in% str_extract(data2$x2, "^[^\\|]*"))

# A tibble: 6 × 4
     x1 x2                  x3 in_data2
  <dbl> <chr>            <int> <lgl>   
1     1 a 1-metha (akD)     10 TRUE    
2     3 methal methal       11 TRUE    
3     7 methy               12 TRUE    
4     7 3-[3-(methy)prox    13 TRUE    
5     4 3-carbon (C:H)      14 TRUE    
6     7 z                   15 FALSE 

We could use str_detect with fixed() , see https://cran.r-project.org/web/packages/stringr/vignettes/stringr.html#fixed-matches

library(dplyr)
library(stringr)

data1 %>% 
  mutate(check = str_detect(x2, fixed(data2$x2)))
  x1               x2 x3 check
1  1  a 1-metha (akD) 10 FALSE
2  3    methal methal 11 FALSE
3  7            methy 12  TRUE
4  7 3-[3-(methy)prox 13  TRUE
5  4   3-carbon (C:H) 14  TRUE
6  7                z 15 FALSE

You can use colSums on the matrix returned from using sapply to check the each row of data1 against the entire column of data2.

data1$isin <- (colSums(sapply(data1$x2, \(x) grepl(x, data2$x2, fixed = T))) > 0) 
x1               x2 x3  isin
1  1  a 1-metha (akD) 10  TRUE
2  3    methal methal 11  TRUE
3  7            methy 12  TRUE
4  7 3-[3-(methy)prox 13  TRUE
5  4   3-carbon (C:H) 14  TRUE
6  7                z 15 FALSE

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM