格式化字符串以在R / Shiny中搜索引擎样式

Question

I am working on a seemingly simple problem that nevertheless seems to be an annoying regex calculation. 我正在研究一个看似简单的问题，尽管这似乎是一个令人讨厌的正则表达式计算。

I am designing a shiny app that allows users to search a database for strings and count the number of string matches. 我正在设计一个闪亮的应用程序，允许用户在数据库中搜索字符串并计算字符串匹配的数量。

From the stringr package, my ultimate call is: 从stringer包中，我的最终呼叫是：

str_count(text, pattern=REGEX(user_input))

My goal is to transform the user input into an appropriate regex - while allowing the user to input the data in standard search term format. 我的目标是将用户输入转换为适当的正则表达式-同时允许用户以标准搜索字词格式输入数据。

So the following user input: 因此，以下用户输入：

artist picasso "picasso painting" france

should be formed into the following regex: 应该形成以下正则表达式：

artist|picasso|picasso painting|france

Where the solution knows to treat "picasso painting" as one word due to the quotes. 由于引号，解决方案知道将“毕加索绘画”视为一个词。

Any help is appreciated! 任何帮助表示赞赏！

Answer 1

Here is a base R solution: 这是基本的R解决方案：

regex.escape <- function(string) {
  gsub("([][{}()+*^${|\\\\?])", "\\\\\\1", string)
}

sort.by.length.desc <- function (v) v[order( -nchar(v)) ] 

s <- "artist picasso \"picasso (painting)\" france zoo"
keys <- c(t(read.table(text=s, header=FALSE)))          # Read in the values
keys <- sort.by.length.desc(keys)                       # Sort the values
pattern = paste(regex.escape(keys), collapse="|")       # Create the pattern
## Test
## cat(pattern, sep="\n")                               # This shows the regex pattern
txt <- "The artist was born in france and named picasso picasso (painting)"
length(unlist(gregexpr(pattern, txt)))                  # Count the number of occurrences
[1] 4

See the R demo . 参见R演示。 There are 4 matches, thus, the output is 4 . 有4个匹配项，因此输出为4 。

Details : 详细资料 ：

The regex.escape function escapes the most important chars that a regex engine may interpret as special characters regex.escape函数转义正则表达式引擎可能解释为特殊字符的最重要字符
The sort.by.length.desc orders the items of the character vector by character vector length in a descending order sort.by.length.desc按字符向量长度降序排列字符向量的项
The c(t(read.table(text=s, header=FALSE))) reads the user input and stores as a character vector in keys c(t(read.table(text=s, header=FALSE)))读取用户输入并将其作为字符向量存储在keys
The pattern = paste(regex.escape(keys), collapse="|") creates a pattern with alternation operators (looks like picasso \$painting\$|picasso|artist|france|zoo , cat(pattern, sep="\\n") displays the resulting pattern as a literal string) pattern = paste(regex.escape(keys), collapse="|")创建带有交替运算符的模式（看起来像picasso \$painting\$|picasso|artist|france|zoo ， cat(pattern, sep="\\n")将结果模式显示为文字字符串）
The length(unlist(gregexpr(pattern, txt))) line counts occurrences of a match using base R gregexpr function. length(unlist(gregexpr(pattern, txt)))行使用base R gregexpr函数对匹配的发生进行计数。

Answer 2

Split it up by doing global match using "[^"]*"|\\S+ . 通过使用"[^"]*"|\\S+进行全局匹配将其拆分。
Blindly remove leading/trailing double quotes ^"|"$ . 盲目删除前导/尾随双引号^"|"$ 。
Push the matches into an array. 将匹配项推入数组。
Sort the array longest on top (descending ?). 对顶部最长的数组进行排序（降序为？）。
Replace each element's metachars ([\\[$^()*+|{}-\\\\]) with \\\\$1 . 用\\\\$1替换每个元素的元字符([\\[$^()*+|{}-\\\\]) 。
Finally, join the elements together with an alternation | 最后，将元素与交替| . 。

格式化字符串以在R / Shiny中搜索引擎样式

问题描述

2 个解决方案

解决方案1
2 已采纳 2017-07-21 20:28:50

解决方案2
0

格式化字符串以在R / Shiny中搜索引擎样式

问题描述

2 个解决方案

解决方案1 2 已采纳 2017-07-21 20:28:50

解决方案2 0

解决方案1
2 已采纳 2017-07-21 20:28:50

解决方案2
0