简体   繁体   中英

Convert SphinxSearch query syntax to boolean search string in Ruby

I've been pondering over what is the easiest way to convert the following Sphinx Search query into what is commonly used in typical web searches or portals, eg a boolean search string, and vice versa

(A | B) "C D" (E | "F G" | "H I J") ("K L" ("M N" | "O P")) Q R

Needs to be converted to

(A OR B) AND "C D" AND (E OR "F G" OR "H I J") AND ("K L" AND ("M N" OR "O P")) AND Q AND R

also slight variation for example purposes

(A | B) C D (E | "F G" | "H I J") ("K L" ("M N" | "O P")) Q R

should be

(A OR B) AND C AND D AND (E OR "F G" OR "H I J") AND ("K L" AND ("M N" OR "O P")) AND Q AND R

For clarity, "A" can be any word and any case, its not case sensitive. Spaces denote AND in the starting syntax unless inside quotes. So AB would simply be one word eg Java. The space between (A|B) isnt important (A|B) is the same as ( A | B ) or (A | B) etc. Each letter denotes a word.

Some of these queries will be quite long - upto 500 terms. Although this isn't a huge overhead to process, I'm thinking what would be the BEST (most efficient) way to convert this. Tokenization, Regex/pattern matching, simple replace, recursion etc. What would any of you recommend?

Readers are perhaps looking for an elegant, at least not hackish, solution to this problem. That was my objective as well, but, alas, this is the best I've been able to come up with.

Code

def convert(str)
  subs = []
  str.gsub(/"[^"]*"| *\| */) do |s|
    if s.match?(/ *\| */)
      '|'
    else
      subs << s
      '*'
    end
  end.gsub(/ +/, ' AND ').
      gsub(/[*|]/) { |s| s == '|' ? ' OR ' : subs.shift }
end

Examples

puts convert(%Q{(A | B) "C D" (E | "F G" | "H I J") ("K L" ("M N" | "O P")) Q R})
  #-> (A OR B) AND "C D" AND (E OR "F G" OR "H I J") AND ("K L" AND ("M N" OR "O P")) AND Q AND R
puts convert(%Q{(A|B)   C D (E| "F G" |"H I J") ("K L"   ("M N" | "O P")) Q R})
  #-> (A OR B) AND C AND D AND (E OR "F G" OR "H I J") AND ("K L" AND ("M N" OR "O P")) AND Q AND R

Notice that in this example there is no space before and/or after some pipes and in some places outside double-quoted strings there are multiple spaces.

puts convert(%Q{(Ant | Bat) Cat Dod (Emu | "Frog Gorilla" | "Hen Ibex Jackel") ("Kwala Lynx" ("Magpie N" | "Ocelot Penguin")) Quail Rabbit})
  #-> (Ant OR Bat) AND Cat AND Dod AND (Emu OR "Frog Gorilla" OR "Hen Ibex Jackel") AND ("Kwala Lynx" AND ("Magpie N" OR "Ocelot Penguin")) AND Quail AND Rabbit

Here I've replaced the capital letters with words.

Explanation

To see how this works let

str = %Q{(A | B) "C D" (E | "F G" | "H I J") ("K L" ("M N" | "O P")) Q R}
  #=> "(A | B) \"C D\" (E | \"F G\" | \"H I J\") (\"K L\" (\"M N\" | \"O P\")) Q R"

then

subs = []
str.gsub(/"[^"]*"| *\| */) do |s|
  if s.match?(/ *\| */)
    '|'
  else
    subs << s
    '*'
  end
end
  #=> "(A|B) * (E|*|*) (* (*|*)) Q R"
  subs
    #=> ["\"C D\"", "\"F G\"", "\"H I J\"", "\"K L\"", "\"M N\"", "\"O P\""]

As you see, I have removed the spaces around pipes and replaced all quoted strings with asterisks, saving those strings in the array subs , so that I can later replace the asterisks with their original values. The choice of an asterisk is of course arbitrary.

The regular expression reads, "match a double-quoted string of zero or more characters or a pipe ( '|' ) optionally preceded and/or followed by spaces".

As a result of these substitutions, all remaining strings of spaces are to be replaced by ' AND ' :

s2 = s1.gsub(' +', ' AND ')
  #=> "(A|B) AND * AND (E|*|*) AND (* AND (*|*)) AND Q AND R"

It remains to replace '|' with ' OR ' and each asterisk by its original value:

s2.gsub(/[*|]/) { |s| s == '|' ? ' OR ' : subs.shift }
  #=> "(A OR B) AND \"C D\" AND (E OR \"F G\" OR \"H I J\") AND (\"K L\" AND (\"M N\" OR \"O P\")) AND Q AND R"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM