简体   繁体   English

格式化字符串以在R / Shiny中搜索引擎样式

[英]Format string to search engine style in R/Shiny

I am working on a seemingly simple problem that nevertheless seems to be an annoying regex calculation. 我正在研究一个看似简单的问题,尽管这似乎是一个令人讨厌的正则表达式计算。

I am designing a shiny app that allows users to search a database for strings and count the number of string matches. 我正在设计一个闪亮的应用程序,允许用户在数据库中搜索字符串并计算字符串匹配的数量。

From the stringr package, my ultimate call is: 从stringer包中,我的最终呼叫是:

str_count(text, pattern=REGEX(user_input))

My goal is to transform the user input into an appropriate regex - while allowing the user to input the data in standard search term format. 我的目标是将用户输入转换为适当的正则表达式-同时允许用户以标准搜索字词格式输入数据。

So the following user input: 因此,以下用户输入:

artist picasso "picasso painting" france

should be formed into the following regex: 应该形成以下正则表达式:

artist|picasso|picasso painting|france

Where the solution knows to treat "picasso painting" as one word due to the quotes. 由于引号,解决方案知道将“毕加索绘画”视为一个词。

Any help is appreciated! 任何帮助表示赞赏!

Here is a base R solution: 这是基本的R解决方案:

regex.escape <- function(string) {
  gsub("([][{}()+*^${|\\\\?])", "\\\\\\1", string)
}

sort.by.length.desc <- function (v) v[order( -nchar(v)) ] 

s <- "artist picasso \"picasso (painting)\" france zoo"
keys <- c(t(read.table(text=s, header=FALSE)))          # Read in the values
keys <- sort.by.length.desc(keys)                       # Sort the values
pattern = paste(regex.escape(keys), collapse="|")       # Create the pattern
## Test
## cat(pattern, sep="\n")                               # This shows the regex pattern
txt <- "The artist was born in france and named picasso picasso (painting)"
length(unlist(gregexpr(pattern, txt)))                  # Count the number of occurrences
[1] 4

See the R demo . 参见R演示 There are 4 matches, thus, the output is 4 . 有4个匹配项,因此输出为4

Details : 详细资料

  • The regex.escape function escapes the most important chars that a regex engine may interpret as special characters regex.escape函数转义正则表达式引擎可能解释为特殊字符的最重要字符
  • The sort.by.length.desc orders the items of the character vector by character vector length in a descending order sort.by.length.desc按字符向量长度降序排列字符向量的项
  • The c(t(read.table(text=s, header=FALSE))) reads the user input and stores as a character vector in keys c(t(read.table(text=s, header=FALSE)))读取用户输入并将其作为字符向量存储在keys
  • The pattern = paste(regex.escape(keys), collapse="|") creates a pattern with alternation operators (looks like picasso \\(painting\\)|picasso|artist|france|zoo , cat(pattern, sep="\\n") displays the resulting pattern as a literal string) pattern = paste(regex.escape(keys), collapse="|")创建带有交替运算符的模式(看起来像picasso \\(painting\\)|picasso|artist|france|zoocat(pattern, sep="\\n")将结果模式显示为文字字符串)
  • The length(unlist(gregexpr(pattern, txt))) line counts occurrences of a match using base R gregexpr function. length(unlist(gregexpr(pattern, txt)))行使用base R gregexpr函数对匹配的发生进行计数。

Split it up by doing global match using "[^"]*"|\\S+ . 通过使用"[^"]*"|\\S+进行全局匹配将其拆分。
Blindly remove leading/trailing double quotes ^"|"$ . 盲目删除前导/尾随双引号^"|"$
Push the matches into an array. 将匹配项推入数组。
Sort the array longest on top (descending ?). 对顶部最长的数组进行排序(降序为?)。
Replace each element's metachars ([\\[$^()*+|{}-\\\\]) with \\\\$1 . \\\\$1替换每个元素的元字符([\\[$^()*+|{}-\\\\])
Finally, join the elements together with an alternation | 最后,将元素与交替| .

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM