简体   繁体   English

将数据框中的两个向量与 %in% 与 R 进行比较

[英]Compare two vectors within a data frame with %in% with R

Compare two vectors within a data frame with %in%将数据框中的两个向量与 %in% 进行比较

I have the following data我有以下数据

T1 <- data.frame( "Col1" = c("a", "b", "aa", "d"), "Col2" = c("a,b,c", "aa,c,d", "c,d,e", "d,f,g") )

Col1 Col1 Col2 Col2
a一个 a,b,c a,b,c
b b aa,c,d aa,c,d
aa c,d,e c,d,e
d d d,f,g d,f,g

I want to select the rows that contain a character from this vector c("a", "e", "g"), specifying the columna我想从这个向量c(“a”,“e”,“g”)中选择包含一个字符的行,指定columna

library(dplyr)

T1 %>% filter(Col1 %in% c("a", "e", "g"))

I returned我回来了

1 aa,b,c

It is correct, but if I want to compare two vectors, example:这是正确的,但如果我想比较两个向量,例如:

With unlist and strsplit, I transform the value of each row to a character vector and try to compare it with the reference vector to select the rows that contain any of the values:使用 unlist 和 strsplit,我将每一行的值转换为字符向量,并尝试将其与参考向量进行比较以选择包含任何值的行:

unlist(strsplit(T1$Col2[1],","))

[1] "a" "b" "c"

T1 %>% filter(unlist(strsplit(Col2,",")) %in% c("a", "e", "g"))

It gives me an error: Error in filter() : !它给了我一个错误: filter()中的错误:! Problem while computing ..1 = unlist(strsplit(Col2, ",")) %in% c("a", "e", "g") .计算..1 = unlist(strsplit(Col2, ",")) %in% c("a", "e", "g")问题。 ✖ Input ..1 must be of size 4 or 1, not size 12. Run ]8;;rstudio:run:rlang::last_error()rlang::last_error() ]8;; ✖ 输入..1的大小必须为 4 或 1,而不是 12。运行]8;;rstudio:run:rlang::last_error()rlang::last_error() ]8;; to see where the error occurred.查看错误发生的位置。

I can do it like this:我可以这样做:

T1[grep(c("a|e|g"), T1$Col2),]

1 aa,b,c

2 b aa,c,d

3 aa c,d,e

4 dd,f,g

But it's wrong, row 3 aa c,d,e , shouldn't be, because it's not a , it's aa但这是错误的,第3 aa c,d,e不应该是,因为它不是a ,它是aa

To search for the "a" alone, you would have to do:要单独搜索"a" ,您必须执行以下操作:

T1[grep(c("\\<a\\>"), T1$Col2),]

I think that with this form I will end up making a mistake, it would give me more security to be able to do it comparing vector with vector:我认为使用这种形式我最终会犯错误,它将给我更多的安全性来比较向量和向量:

T1 %>% filter(unlist(strsplit(Col2,",")) %in% c("a", "e", "g"))

Edited answer编辑后的答案

You can use the syntax \\b for regular expressions word boundary.您可以将语法\\b用于正则表达式单词边界。 The || is for boundaries adjacent to like an or operation.用于与类似或操作相邻的边界。 You can use the following code:您可以使用以下代码:

T1 <- data.frame( "Col1" = c("a", "b", "aa", "d"), "Col2" = c("a,b,c", "aa,c,d", "c,d,e", "d,f,g") )
library(dplyr)
library(stringr)
T1 %>% 
  filter(grepl("\\b(a|e|g)\\b", Col2))
#>   Col1  Col2
#> 1    a a,b,c
#> 2   aa c,d,e
#> 3    d d,f,g

Created on 2022-07-16 by the reprex package (v2.0.1)reprex 包于 2022-07-16 创建 (v2.0.1)

Note: \\b is for R version 4.1+ otherwise use \b .注意: \\b用于 R 版本 4.1+,否则使用\b

old answer旧答案

It returns all rows back because you check if one of the strings exists in Col2 and you can see that in row 3, "e" exists which is one of the strings and that's why it returns also row 4. You could also use str_detect like this:它返回所有行,因为您检查 Col2 中是否存在字符串之一,并且您可以看到在第 3 行中,存在“e”,这是字符串之一,这就是它返回第 4 行的原因。您也可以使用str_detect这个:

library(dplyr)
library(stringr)
T1 <- data.frame( "Col1" = c("a", "b", "aa", "d"), "Col2" = c("a,b,c", "aa,c,d", "c,d,e", "d,f,g") )
vector <- c("a", "e", "g")
T1 %>%  
  filter(any(str_detect(Col2, paste0(vector, collapse="|"))))
#>   Col1   Col2
#> 1    a  a,b,c
#> 2    b aa,c,d
#> 3   aa  c,d,e
#> 4    d  d,f,g

Created on 2022-07-16 by the reprex package (v2.0.1)reprex 包于 2022-07-16 创建 (v2.0.1)

If you want to check if the strings exists, one of them, in both columns.如果要检查字符串是否存在,则在两列中都存在其中之一。 You can use the following code:您可以使用以下代码:

library(dplyr)
library(stringr)
T1 <- data.frame( "Col1" = c("a", "b", "aa", "d"), "Col2" = c("a,b,c", "aa,c,d", "c,d,e", "d,f,g") )
vector <- c("a", "e", "g")
T1 %>% 
  filter(Reduce(`|`, across(all_of(colnames(T1)), ~str_detect(paste0(vector, collapse="|"), .x))))
#>   Col1  Col2
#> 1    a a,b,c

Created on 2022-07-16 by the reprex package (v2.0.1)reprex 包于 2022-07-16 创建 (v2.0.1)

Another way you could achieve this (using your original approach with strsplit) is to do it rowwise() and 'sum' the logical test.您可以实现此目的的另一种方法(使用带有 strsplit 的原始方法)是执行rowwise()并“求和”逻辑测试。

T1 %>% 
  rowwise() %>% 
  filter(sum(unlist(strsplit(Col2,",")) %in% c("a","e","g")) >= 1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM