[英]R: Delete rows where one column is a substring of another
I have a data frame that looks like this:我有一个如下所示的数据框:
c1 c2
fish fishing
dog tomato
cat loop
horse horse
I would now like to delete every row where c1 == c2 AND where c1 is a substring of c2 and vice versa.我现在想删除 c1 == c2 AND 其中 c1 是 c2 的子字符串的每一行,反之亦然。 In my example, horse == horse and 'fish' is a substring of 'fishing'.在我的示例中,horse == horse 和 'fish' 是 'fishing' 的子字符串。 I know about the grepl function, eg: df[!grepl(df$c1, df$c2),]
.我知道 grepl 函数,例如: df[!grepl(df$c1, df$c2),]
。
However, this solution does not account for substrings.但是,此解决方案不考虑子字符串。 Maybe there is a solution where I can use df[!grepl("STRING", df$c2),]
for every row, so that "STRING" equals the value of df$c1?也许有一个解决方案,我可以对每一行使用df[!grepl("STRING", df$c2),]
,以便“STRING”等于 df$c1 的值?
Thanks in advance!提前致谢!
Using tidyverse
:使用tidyverse
:
library(tidyverse)
df %>%
filter(!str_detect(c2, c1), !str_detect(c1, c2))
Output:输出:
c1 c2
1: dog tomato
2: cat loop
This will work no matter which columns have similar words (not just like in your specific example).无论哪些列具有相似的单词(不仅仅是在您的特定示例中),这都将起作用。
dat[!with(dat, mapply(grepl, c1, c2)) & !with(dat, mapply(grepl, c2, c1)),]
# c1 c2
# 2 dog tomato
# 3 cat loop
grepl
only works on one pattern at a time: if you try multiple patterns (ie, each of dat$c1
), then you'll receive a warning (and not the intended output). grepl
仅适用于一种模式:如果您尝试多种模式(即,每个dat$c1
),那么您将收到警告(而不是预期的输出)。
grepl(dat$c1, dat$c2)
# Warning in grepl(dat$c1, dat$c2) :
# argument 'pattern' has length > 1 and only the first element will be used
# [1] TRUE FALSE FALSE FALSE
We vectorize it (with mapply
) and run it iteratively on each of the c1
/ c2
pairs.我们对其进行矢量化(使用mapply
)并在每个c1
/ c2
对上迭代运行它。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.