简体   繁体   English


[英]Find matching patterns from list of patterns using grepl

I used grepl to check whether a string contains any of the patterns from a set of patterns (I used '|' to separate the patterns). 我使用grepl检查一个字符串是否包含一组模式中的任何模式(我使用'|'来分隔模式)。 Reverse search didn't help. 反向搜索没有帮助。 How to identify the set of patterns that match? 如何识别匹配的模式集?

Additional information: This can be solved by writing a loop, but it is very time consuming as my set has > 100,000 strings. 附加信息:这可以通过编写循环来解决,但由于我的集合具有> 100,000个字符串,因此非常耗时。 Can it be optimized? 可以优化吗?

Eg: Let the string be a <- "Hello" 例如:让字符串为a <- "Hello"

pattern <- c("ll", "lo", "hl")

pattern1 <- paste(pattern, collapse="|") # "ll|lo|hl"

grepl(a, pattern=pattern1) # returns TRUE

grepl(pattern, pattern=a) # returns FALSE 'n' times - n is 3 here

You are looking for str_detect from package stringr : 您正在寻找str_detect从包stringr


str_detect(a, pattern)

In case you have multiple strings like a = c('hello','hola','plouf') you can do: 如果您有多个字符串,如a = c('hello','hola','plouf')您可以执行以下操作:

lapply(a, function(u) pattern[str_detect(u, pattern)])

You can also use base R with a lookahead expression, (?=) , since the patterns overlap. 您也可以使用带有超前表达式的基本R (?=) ,因为模式重叠。 With gregexpr you can extract the match location for each grouped pattern as a matrix. 使用gregexpr您可以将每个分组模式的匹配位置提取为矩阵。

## changed your string so the second pattern matches twice
a <- "Hellolo"
pattern <- c("ll", "lo", "hl")
pattern1 <- sprintf("(?=(%s))", paste(pattern, collapse=")|(")) #  "(?=(ll)|(lo)|(hl))"

attr(gregexpr(pattern1, a, perl=T)[[1]], "capture.start")
# [1,] 3 0 0
# [2,] 0 4 0
# [3,] 0 6 0

Each column of the matrix corresponds to the patterns, so pattern 2 matched positions 4 and 6 in the test string, pattern 1 matched at position 3, and so on. 矩阵的每列对应于图案,因此图案2匹配测试串中的位置4和6,图案1匹配位置3,依此类推。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM