简体   繁体   English

解析Lucene查询语法并转义CloudSearch

[英]Parsing Lucene Query syntax and escaping for CloudSearch

Basically, I have an application that needs to support both Lucene.NET and Amazon CloudSearch. 基本上,我有一个需要同时支持Lucene.NET和Amazon CloudSearch的应用程序。

So, I can't re-write the queries, I need to use the standard queries from lucene, and use the .ToString() on the query to get the syntax. 因此,我无法重新编写查询,我需要使用lucene的标准查询,并在查询上使用.ToString()来获取语法。

The issue is that in Lucene.NET (I don't know if this is the same in the java version), the .ToString() method return the raw string without the escape characters. 问题是在Lucene.NET(我不知道Java版本是否相同)中,.ToString()方法返回不带转义字符的原始字符串。

Therefore, things like: 因此,类似:

(title:blah:blah summary:"lala:la")

should be 应该

(title:blah\:blah summary:"lala\:la")

What I need is a regex that will add the escapes. 我需要的是一个会添加转义符的正则表达式。

Is this possible? 这可能吗? and if so, what would it look like. 如果是这样,它将是什么样子。

Some additional possible variances: 一些其他可能的差异:

(title:"this is a search:term")
(field5:"this is a title:term")

Based on comments and edits, it seems that you want any query string to be able to be correctly escaped by the regex, and any given lucene query to be accurately represented by the resulting string. 根据注释和编辑,您似乎希望任何查询字符串都可以被正则表达式正确地转义,并且希望任何给定的Lucene查询都可以由结果字符串准确表示。

That ain't gonna happen. 那不会发生。

Lucene query syntax is not capable of expressing all lucene queries. Lucene查询语法无法表达所有Lucene查询。 In fact, the string you get from Query.toString() often can't even be parsed by the QueryParser , nevermind being an accurate reconstruction of the query. 实际上,从Query.toString()获得的字符串通常甚至不能被QueryParser 解析 ,请记住这并不是对查询的精确重构。

The long and short of it: You are going about this the wrong way. 总而言之:您将以错误的方式进行操作。 Query.ToString() is not designed to serialize the query, and it's goal is not to create a parsable string query. Query.ToString() 并非旨在对查询进行序列化,并且其目标不是创建可分析的字符串查询。 It's mainly for debugging and such. 它主要用于调试等。 If you keep attempting to use it this way, this tomfoolery of trying to use a regex to escape ambiguous query syntax will likely just be the start of your troubles. 如果您一直尝试以这种方式使用它,那么试图使用正则表达式来摆脱歧义查询语法的这种愚蠢行为可能只是麻烦的开始

This question provides another example of this . 这个问题提供了另一个例子

You can use this regex to escape the colon : at strategic points of the string 您可以使用此正则表达式转义冒号:在字符串的关键点

(?<!title|summary):

Then escape the captured colon : 然后逃脱捕获的冒号:

Explanation 说明

Look behind ?<! 向后看?<! for any colon that is not followed by title or summary , then match the colon : 对于任何没有titlesummary的冒号,请匹配冒号:

See Demo 观看演示

input 输入

(title:blah:blah summary:"lala:la")

Output 产量

(title:blah\:blah summary:"lala\:la")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM