简体   繁体   中英

Using variable to create regular expression pattern in R

I have a function:

ncount <- function(num = NULL) {

 toRead <- readLines("abc.txt")
 n <- as.character(num)
 x <- grep("{"n"} number",toRead,value=TRUE)

}

While grep-ing, I want the num passed in the function to dynamically create the pattern to be searched? How can this be done in R? The text file has number and text in every line

您可以使用paste连接字符串:

grep(paste("{", n, "} number", sep = ""),homicides,value=TRUE)

In order to build a regular expression from variables in R, in the current scenarion, you may simply concatenate string literals with your variable using paste0 :

grep(paste0('\\{', n, '} number'), homicides, value=TRUE)

Note that { is a special character outside a [...] bracket expression (also called character class ), and should be escaped if you need to find a literal { char.

In case you use a list of items as an alternative list , you may use a combination of paste / paste0 :

words <- c('bananas', 'mangoes', 'plums')
regex <- paste0('Ben likes (', paste(words, collapse='|'), ')\\.')

The resulting Ben likes (bananas|mangoes|plums)\\. regex will match Ben likes bananas. , Ben likes mangoes. or Ben likes plums. . See the R demo and the regex demo .

NOTE : PCRE (when you pass perl=TRUE to base R regex functions) or ICU ( stringr / stringi regex functions) have proved to better handle these scenarios, it is recommended to use those engines rather than the default TRE regex library used in base R regex functions.

Oftentimes, you will want to build a pattern with a list of words that should be matched exactly, as whole words. Here, a lot will depend on the type of boundaries and whether the words can contain special regex metacharacters or not, whether they can contain whitespace or not.

In the most general case, word boundaries ( \\b ) work well.

regex <- paste0('\\b(', paste(words, collapse='|'), ')\\b')
unlist(regmatches(examples, gregexpr(regex, examples, perl=TRUE)))
## => [1] "bananas" "mangoes" "plums"  

The \\b(bananas|mangoes|plums)\\b pattern will match bananas , but won't match banana (see an R demo ).

If your list is like

words <- c('cm+km', 'uname\\vname')

you will have to escape the words first, ie append \\ before each of the metacharacter:

regex.escape <- function(string) {
  gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
examples <- c('Text: cm+km, and some uname\\vname?')
words <- c('cm+km', 'uname\\vname')
regex <- paste0('\\b(', paste(regex.escape(words), collapse='|'), ')\\b')
cat( unlist(regmatches(examples, gregexpr(regex, examples, perl=TRUE))) )
## => cm+km uname\vname 

If your words can start or end with a special regex metacharacter, \\b word boundaries won't work. Use

  • Unambiguous word boundaries , (?<!\\w) / (?!\\w) , when the match is expected between non-word chars or start/end of string
  • Whitespace boundaries , (?<!\\S) / (?!\\S) , when the match is expected to be enclosed with whitespace chars, or start/end of string
  • Build your own using the lookbehind/lookahead combination and your custom character class / bracket expression, or even more sophisticad patterns.

Example of the first two approaches in R (replacing with the match enclosed with << and >> ):

regex.escape <- function(string) {
  gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
examples <- 'Text: cm+km, +km and C++,Delphi,C++CLI and C++/CLI.'
words <- c('+km', 'C++')
# Unambiguous word boundaries
regex <- paste0('(?<!\\w)(', paste(regex.escape(words), collapse='|'), ')(?!\\w)')
gsub(regex, "<<\\1>>", examples, perl=TRUE)
# => [1] "Text: cm+km, <<+km>> and <<C++>>,Delphi,C++CLI and <<C++>>/CLI."
# Whitespace boundaries
regex <- paste0('(?<!\\S)(', paste(regex.escape(words), collapse='|'), ')(?!\\S)')
gsub(regex, "<<\\1>>", examples, perl=TRUE)
# => [1] "Text: cm+km, <<+km>> and C++,Delphi,C++CLI and C++/CLI."

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM