简体   繁体   English

Marklogic通配符搜索误报问题(SQL类似于%something%等价)

[英]Marklogic wildcarded search false positives issue ( SQL like %something% equivalent )

I have a question regarding the behavior of the wildcarded search in MarkLogic. 我对MarkLogic中通配符搜索的行为有疑问。

Basically, what I am trying to do is to replicate the SQL like %something% query. 基本上,我想做的是复制SQL,例如%something%查询。

Here is the code that returns false positives: 这是返回误报的代码:

xquery version "1.0-ml";
cts:search(/, 
  cts:element-query(fn:QName("","Document"),
  cts:element-word-query(fn:QName("","Information"),"*date*", ("wildcarded"),0), ()),
  'unfiltered')

A few notes: 一些注意事项:

  • The unfiltered option must stay, because performance is necessary. 必须保留未过滤的选项,因为性能是必需的。
  • I am using the Unicode Collation and have enabled : 我正在使用Unicode排序规则并已启用:

    • three character searches 三个字符的搜索
    • trailing wildcard searches 尾随通配符搜索
    • fast element trailing wildcard searches 快速元素尾随通配符搜索
    • two character searches 两个字符的搜索
    • one character searches 一字搜索
    • fast element character searches 快速元素字符搜索

What I don't understand is why "*something" and "something*" return correct values, but "*something*" returns false positives? 我不明白的是为什么“ * something”和“ something *”返回正确的值,而“ * something *”却返回假阳性? How can I fix this? 我怎样才能解决这个问题?

Input example: 输入示例:

  1. <Document><Information>another updated document</Information></Document>
  2. <Document><Information>INCUMBENCY CERTIFICATE</Information></Document>
  3. <Document><Information>Certificate of Incumbency</Information></Document>
  4. <Document><Information>something 344_dated 243</Information></Document>
  5. <Document><Information>another terminated document</Information></Document>

Output: 输出:

All documents are a match, although only 1 and 4 should be returned. 所有文档都是匹配的,尽管只返回1和4。

Final edit: The only thing I would like to add is that it seemed that on two databases - one with a heavier load of documents, the same settings did not generate the same results. 最终编辑:我唯一想补充的是,似乎在两个数据库上-一个数据库带有大量文件,相同的设置不会产生相同的结果。 On the database with lots of documents, the final settings that I used and which give the correct results are : 在包含大量文档的数据库上,我使用的最终设置给出了正确的结果是:

  • word searches 单词搜索
  • word positions 单词位置
  • triple index 三重指数
  • fast element word searches 快速元素词搜索
  • element word positions 元素词位置
  • fast element phrase searches 快速元素短语搜索
  • three character searches 三个字符的搜索
  • three character word positions 三个字词位置
  • fast element character searches 快速元素字符搜索
  • trailing wildcard searches 尾随通配符搜索
  • trailing wildcard word positions 尾随通配符词位置
  • fast element trailing wildcard searches 快速元素尾随通配符搜索
  • word lexicon : codepoint collation 词词典:代码点整理

Unfiltered wildcard queries within specific elements (ie not just with a document) may return false positives without positional indexes. 特定元素(即,不仅包含文档)中未过滤的通配符查询可能返回不带位置索引的误报。 I would try enabling either or both of word positions and element word positions . 我将尝试启用word positionselement word positions一个或两个。 It may also be worth testing whether you see additional performance improvements from enabling fast element phrase searches . 也许还值得测试一下,是否通过启用fast element phrase searches看到了其他性能改进。

It's possible that simply because "*something and something*" contains more terms it is filtering out false positives and not because it is more accurately resolving that phrase though indexes. 可能是因为“ * something and something *”包含更多的术语,所以它过滤掉了误报,而不是因为它通过索引更准确地解析了该短语。

Update: After reviewing your updated test case, it appears that trailing wildcard index accuracy is not good enough without trailing wildcard word positions enabled. 更新:查看更新的测试用例后,如果未启用trailing wildcard word positions ,则尾随通配符索引准确性似乎不够好。 That and three character word positions appear to be necessary to resolve this type of leading-and-trailing element wildcard query. three character word positionsthree character word positions对于解决这种类型的前导和后继元素通配符查询似乎是必需的。

I would recommend disabling one character searches and two character searches if they are not strictly necessary, since they will generate large indexes. 如果非严格必要,我建议禁用one character searchestwo character searches ,因为它们会生成较大的索引。 fast element character searches and fast element trailing wildcard searches also do not appear to be required for accuracy in your case, so you might want to test if your queries are fast enough without them. 在您的情况下, fast element character searchesfast element trailing wildcard searches对于准确性也不是必需的,因此您可能希望测试如果没有它们,查询是否足够快。

While using the cts:element-value-query, did you tried using the "exact" options to get your exact results ? 在使用cts:element-value-query时,您是否尝试过使用“精确”选项来获得准确的结果? Try using that once and let me know how it behaves. 尝试使用一次,让我知道它的行为。 I have faced a similar issue once. 我曾经遇到过类似的问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM