简体   繁体   English

将 SphinxSearch 查询语法转换为 Ruby 中的 boolean 搜索字符串

[英]Convert SphinxSearch query syntax to boolean search string in Ruby

I've been pondering over what is the easiest way to convert the following Sphinx Search query into what is commonly used in typical web searches or portals, eg a boolean search string, and vice versa我一直在思考将以下 Sphinx 搜索查询转换为典型 web 搜索或门户中常用的最简单方法,例如 boolean 搜索字符串,反之亦然

(A | B) "C D" (E | "F G" | "H I J") ("K L" ("M N" | "O P")) Q R

Needs to be converted to需要转换为

(A OR B) AND "C D" AND (E OR "F G" OR "H I J") AND ("K L" AND ("M N" OR "O P")) AND Q AND R

also slight variation for example purposes出于示例目的,也有细微的变化

(A | B) C D (E | "F G" | "H I J") ("K L" ("M N" | "O P")) Q R

should be应该

(A OR B) AND C AND D AND (E OR "F G" OR "H I J") AND ("K L" AND ("M N" OR "O P")) AND Q AND R

For clarity, "A" can be any word and any case, its not case sensitive.为清楚起见,“A”可以是任何单词和任何大小写,它不区分大小写。 Spaces denote AND in the starting syntax unless inside quotes.除非在引号内,否则空格在起始语法中表示 AND。 So AB would simply be one word eg Java.所以 AB 只是一个词,例如 Java。 The space between (A|B) isnt important (A|B) is the same as ( A | B ) or (A | B) etc. Each letter denotes a word. (A|B) 之间的空格不重要 (A|B) 与 (A | B) 或 (A | B) 等相同。每个字母表示一个单词。

Some of these queries will be quite long - upto 500 terms.其中一些查询将很长 - 最多 500 个术语。 Although this isn't a huge overhead to process, I'm thinking what would be the BEST (most efficient) way to convert this.虽然这不是一个巨大的处理开销,但我在想什么是最好的(最有效的)转换方式。 Tokenization, Regex/pattern matching, simple replace, recursion etc. What would any of you recommend?标记化、正则表达式/模式匹配、简单替换、递归等。你们会推荐什么?

Readers are perhaps looking for an elegant, at least not hackish, solution to this problem.读者可能正在寻找一个优雅的,至少不是骇人听闻的解决方案来解决这个问题。 That was my objective as well, but, alas, this is the best I've been able to come up with.这也是我的目标,但是,唉,这是我能想到的最好的。

Code代码

def convert(str)
  subs = []
  str.gsub(/"[^"]*"| *\| */) do |s|
    if s.match?(/ *\| */)
      '|'
    else
      subs << s
      '*'
    end
  end.gsub(/ +/, ' AND ').
      gsub(/[*|]/) { |s| s == '|' ? ' OR ' : subs.shift }
end

Examples例子

puts convert(%Q{(A | B) "C D" (E | "F G" | "H I J") ("K L" ("M N" | "O P")) Q R})
  #-> (A OR B) AND "C D" AND (E OR "F G" OR "H I J") AND ("K L" AND ("M N" OR "O P")) AND Q AND R
puts convert(%Q{(A|B)   C D (E| "F G" |"H I J") ("K L"   ("M N" | "O P")) Q R})
  #-> (A OR B) AND C AND D AND (E OR "F G" OR "H I J") AND ("K L" AND ("M N" OR "O P")) AND Q AND R

Notice that in this example there is no space before and/or after some pipes and in some places outside double-quoted strings there are multiple spaces.请注意,在此示例中,某些管道之前和/或之后没有空格,并且在双引号字符串之外的某些地方有多个空格。

puts convert(%Q{(Ant | Bat) Cat Dod (Emu | "Frog Gorilla" | "Hen Ibex Jackel") ("Kwala Lynx" ("Magpie N" | "Ocelot Penguin")) Quail Rabbit})
  #-> (Ant OR Bat) AND Cat AND Dod AND (Emu OR "Frog Gorilla" OR "Hen Ibex Jackel") AND ("Kwala Lynx" AND ("Magpie N" OR "Ocelot Penguin")) AND Quail AND Rabbit

Here I've replaced the capital letters with words.在这里,我用单词替换了大写字母。

Explanation解释

To see how this works let要看看这是如何工作的,让

str = %Q{(A | B) "C D" (E | "F G" | "H I J") ("K L" ("M N" | "O P")) Q R}
  #=> "(A | B) \"C D\" (E | \"F G\" | \"H I J\") (\"K L\" (\"M N\" | \"O P\")) Q R"

then然后

subs = []
str.gsub(/"[^"]*"| *\| */) do |s|
  if s.match?(/ *\| */)
    '|'
  else
    subs << s
    '*'
  end
end
  #=> "(A|B) * (E|*|*) (* (*|*)) Q R"
  subs
    #=> ["\"C D\"", "\"F G\"", "\"H I J\"", "\"K L\"", "\"M N\"", "\"O P\""]

As you see, I have removed the spaces around pipes and replaced all quoted strings with asterisks, saving those strings in the array subs , so that I can later replace the asterisks with their original values.如您所见,我删除了管道周围的空格并将所有带引号的字符串替换为星号,将这些字符串保存在数组subs中,以便以后可以将星号替换为其原始值。 The choice of an asterisk is of course arbitrary.星号的选择当然是任意的。

The regular expression reads, "match a double-quoted string of zero or more characters or a pipe ( '|' ) optionally preceded and/or followed by spaces".正则表达式读取,“匹配零个或多个字符的双引号字符串或 pipe ( '|' ) 可选地前面和/或后面有空格”。

As a result of these substitutions, all remaining strings of spaces are to be replaced by ' AND ' :由于这些替换,所有剩余的空格字符串都将替换为' AND '

s2 = s1.gsub(' +', ' AND ')
  #=> "(A|B) AND * AND (E|*|*) AND (* AND (*|*)) AND Q AND R"

It remains to replace '|'仍然要替换'|' with ' OR ' and each asterisk by its original value:使用' OR '和每个星号的原始值:

s2.gsub(/[*|]/) { |s| s == '|' ? ' OR ' : subs.shift }
  #=> "(A OR B) AND \"C D\" AND (E OR \"F G\" OR \"H I J\") AND (\"K L\" AND (\"M N\" OR \"O P\")) AND Q AND R"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM