[英]Format string to search engine style in R/Shiny
I am working on a seemingly simple problem that nevertheless seems to be an annoying regex calculation. 我正在研究一个看似简单的问题,尽管这似乎是一个令人讨厌的正则表达式计算。
I am designing a shiny app that allows users to search a database for strings and count the number of string matches. 我正在设计一个闪亮的应用程序,允许用户在数据库中搜索字符串并计算字符串匹配的数量。
From the stringr package, my ultimate call is: 从stringer包中,我的最终呼叫是:
str_count(text, pattern=REGEX(user_input))
My goal is to transform the user input into an appropriate regex - while allowing the user to input the data in standard search term format. 我的目标是将用户输入转换为适当的正则表达式-同时允许用户以标准搜索字词格式输入数据。
So the following user input: 因此,以下用户输入:
artist picasso "picasso painting" france
should be formed into the following regex: 应该形成以下正则表达式:
artist|picasso|picasso painting|france
Where the solution knows to treat "picasso painting" as one word due to the quotes. 由于引号,解决方案知道将“毕加索绘画”视为一个词。
Any help is appreciated! 任何帮助表示赞赏!
Here is a base R solution: 这是基本的R解决方案:
regex.escape <- function(string) {
gsub("([][{}()+*^${|\\\\?])", "\\\\\\1", string)
}
sort.by.length.desc <- function (v) v[order( -nchar(v)) ]
s <- "artist picasso \"picasso (painting)\" france zoo"
keys <- c(t(read.table(text=s, header=FALSE))) # Read in the values
keys <- sort.by.length.desc(keys) # Sort the values
pattern = paste(regex.escape(keys), collapse="|") # Create the pattern
## Test
## cat(pattern, sep="\n") # This shows the regex pattern
txt <- "The artist was born in france and named picasso picasso (painting)"
length(unlist(gregexpr(pattern, txt))) # Count the number of occurrences
[1] 4
See the R demo . 参见R演示 。 There are 4 matches, thus, the output is
4
. 有4个匹配项,因此输出为
4
。
Details : 详细资料 :
regex.escape
function escapes the most important chars that a regex engine may interpret as special characters regex.escape
函数转义正则表达式引擎可能解释为特殊字符的最重要字符 sort.by.length.desc
orders the items of the character vector by character vector length in a descending order sort.by.length.desc
按字符向量长度降序排列字符向量的项 c(t(read.table(text=s, header=FALSE)))
reads the user input and stores as a character vector in keys
c(t(read.table(text=s, header=FALSE)))
读取用户输入并将其作为字符向量存储在keys
pattern = paste(regex.escape(keys), collapse="|")
creates a pattern with alternation operators (looks like picasso \\(painting\\)|picasso|artist|france|zoo
, cat(pattern, sep="\\n")
displays the resulting pattern as a literal string) pattern = paste(regex.escape(keys), collapse="|")
创建带有交替运算符的模式(看起来像picasso \\(painting\\)|picasso|artist|france|zoo
, cat(pattern, sep="\\n")
将结果模式显示为文字字符串) length(unlist(gregexpr(pattern, txt)))
line counts occurrences of a match using base R gregexpr
function. length(unlist(gregexpr(pattern, txt)))
行使用base R gregexpr
函数对匹配的发生进行计数。 Split it up by doing global match using "[^"]*"|\\S+
. 通过使用
"[^"]*"|\\S+
进行全局匹配将其拆分。
Blindly remove leading/trailing double quotes ^"|"$
. 盲目删除前导/尾随双引号
^"|"$
。
Push the matches into an array. 将匹配项推入数组。
Sort the array longest on top (descending ?). 对顶部最长的数组进行排序(降序为?)。
Replace each element's metachars ([\\[$^()*+|{}-\\\\])
with \\\\$1
. 用
\\\\$1
替换每个元素的元字符([\\[$^()*+|{}-\\\\])
。
Finally, join the elements together with an alternation |
最后,将元素与交替
|
. 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.