两个数据集：如何检查一个数据集的一列的值是否包含在 R 中另一个数据集的另一列中？

Question

I have two datasets data1 and data2.我有两个数据集data1和data2。 It should be noted that my data1 contains 300 rows and data2 contains 5000 rows.需要注意的是，我的data1包含300行，data2包含5000行。 Both datasets have a column named x2 (as you can see above).两个数据集都有一个名为 x2 的列（如上所示）。 The x2 column of data2 contains 5000 values on the names of the cars and x2 of data1 contains just 300 names of the cars. data2 的 x2 列包含 5000 个汽车名称值，而 data1 的 x2 列仅包含 300 个汽车名称。
How to check that the x2 of data1 is contained in the x2 of data1?如何检查data1的x2是否包含在data1的x2中？

data1 <- data.frame(x1 = c(1, 3, 7, 7, 4, 7),  
                    x2 = c("a 1-metha (akD)", "methal methal", "methy", "3-[3-(methy)prox", 
                         "3-carbon (C:H)", "z"),
                             x3 = 10:15)

data2 <- data.frame(x1 = c(1, 3, 7, 7, 4, 7),  
                    x2 = c("a 1-metha (akD)|a 1-metha akaikedenioyl|a 1-m(akD)", "methal methal|X.methal methal|methal (22)", "methy", "3-[3-(methy)prox", 
                         "3-carbon (C:H)", "y"),
                             x3 = 20:25)

I just started using the R language.我刚开始使用 R 语言。 But I tried with the grep function.但我尝试使用 grep function。 I try to automate, to avoid doing it value after value.我尝试自动化，以避免在价值之后做它。

matchedValue <- grep(str_extract(data1$x1[1], "([[:alnum:][:punct:][:blank:]]+)"), 
        str_extract(data2$x2, "([[:alnum:][:punct:][:blank:]]+)"),
        ignore.case = T)

I want to know if for example a 1-metha (akD) (Please see column x2 of data1) is also present in x2 of data2 and I want do it automatically for all 300 rows of data1.我想知道例如 1-metha (akD)（请参阅 data1 的 x2 列）是否也存在于 data2 的 x2 中，我想为所有 300 行 data1 自动执行此操作。
How do I do this please?请问我该怎么做？

Answer 1

library(tidyverse)

data1 %>% 
  mutate(in_data2 = x2 %in% str_extract(data2$x2, "^[^\\|]*"))

# A tibble: 6 × 4
     x1 x2                  x3 in_data2
  <dbl> <chr>            <int> <lgl>   
1     1 a 1-metha (akD)     10 TRUE    
2     3 methal methal       11 TRUE    
3     7 methy               12 TRUE    
4     7 3-[3-(methy)prox    13 TRUE    
5     4 3-carbon (C:H)      14 TRUE    
6     7 z                   15 FALSE

Answer 2

We could use str_detect with fixed() , see https://cran.r-project.org/web/packages/stringr/vignettes/stringr.html#fixed-matches我们可以使用str_detect和fixed() ，参见https://cran.r-project.org/web/packages/stringr/vignettes/stringr.html#fixed-matches

library(dplyr)
library(stringr)

data1 %>% 
  mutate(check = str_detect(x2, fixed(data2$x2)))

  x1               x2 x3 check
1  1  a 1-metha (akD) 10 FALSE
2  3    methal methal 11 FALSE
3  7            methy 12  TRUE
4  7 3-[3-(methy)prox 13  TRUE
5  4   3-carbon (C:H) 14  TRUE
6  7                z 15 FALSE

Answer 3

You can use colSums on the matrix returned from using sapply to check the each row of data1 against the entire column of data2.您可以在使用 sapply 返回的矩阵上使用 colSums 来检查 data1 的每一行与 data2 的整个列。

data1$isin <- (colSums(sapply(data1$x2, \(x) grepl(x, data2$x2, fixed = T))) > 0)

x1               x2 x3  isin
1  1  a 1-metha (akD) 10  TRUE
2  3    methal methal 11  TRUE
3  7            methy 12  TRUE
4  7 3-[3-(methy)prox 13  TRUE
5  4   3-carbon (C:H) 14  TRUE
6  7                z 15 FALSE

两个数据集：如何检查一个数据集的一列的值是否包含在 R 中另一个数据集的另一列中？

问题描述

3 个解决方案

解决方案1
0 2022-09-08 19:09:44

解决方案2
0 2022-09-08 19:19:37

解决方案3
0 2022-09-08 19:56:16

两个数据集：如何检查一个数据集的一列的值是否包含在 R 中另一个数据集的另一列中？

问题描述

3 个解决方案

解决方案1 0 2022-09-08 19:09:44

解决方案2 0 2022-09-08 19:19:37

解决方案3 0 2022-09-08 19:56:16

解决方案1
0 2022-09-08 19:09:44

解决方案2
0 2022-09-08 19:19:37

解决方案3
0 2022-09-08 19:56:16