简体   繁体   中英

regular expression with an array of patterns (in R)

I'd like to identify all elements of a string that match an array of patterns. How do I do this? I'd like to avoid clunky for-loops, because I'd like to have the result be invariant to the order in which I specify the patterns.

Here is a simple (non-working) example.

regex = c('a','b')
words = c('goat','sheep','banana','aardvark','cow','bird')
grepl(regex,words)
[1]  TRUE FALSE  TRUE  TRUE FALSE FALSE
Warning message:
In grepl(regex, words) :
  argument 'pattern' has length > 1 and only the first element will be used

EDIT: Sorry, realized that I've seen the answer to this before and just forgotten it -- it'd be grepl('(a)|(b)',words) , but I'd need some way of coercing the array into that form

Use sapply :

> sapply(regex, grepl, words)
         a     b
[1,]  TRUE FALSE
[2,] FALSE FALSE
[3,]  TRUE  TRUE
[4,]  TRUE FALSE
[5,] FALSE FALSE
[6,] FALSE  TRUE

The original question suggested that the above was what was wanted but then it was changed to ask for those elements which contain any element of regex . In that case:

> grepl(paste(regex, collapse = "|"), words)
[1]  TRUE FALSE  TRUE  TRUE FALSE  TRUE

You could do it in the regular expression itself with a look-ahead. Here's an example of stitching the regular expression together from your search terms ( a AND b should only match banana , make sure to set perl = TRUE to enable the (?=...) lookahead in your regexp). It should work for more complicated patterns as well, take a look at this tutorial for details on the look-ahead.

search <- c('a','b')
words <- c('goat','sheep','banana','aardvark','cow','bird')
regex <- paste(paste0("(?=.*", search, ")"), collapse = "")
matches <- grepl(regex,words, perl = T)
print(data.frame(words, matches))

UPDATE: this is for the original question of matching ALL search terms, matching ANY search terms can be achieved as indicated in the edit to the original question

Some time back, I wrote a function called needleInHaystack that can be used as follows:

x <- needleInHaystack(regex, words)
x
#          a b
# goat     1 0
# sheep    0 0
# banana   1 1
# aardvark 1 0
# cow      0 0
# bird     0 1

Depending on if you want all or any , it's easy to use apply (or rowSums ).

apply(x, 1, function(x) any(as.logical(x)))
#     goat    sheep   banana aardvark      cow     bird 
#     TRUE    FALSE     TRUE     TRUE    FALSE     TRUE 
apply(x, 1, function(x) all(as.logical(x)))
#     goat    sheep   banana aardvark      cow     bird 
#    FALSE    FALSE     TRUE    FALSE    FALSE    FALSE 

It's designed for finding things even out of order. So, for example, "to" would match "goat". Not sure if that's a behavior you would want for your problem though.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM