彈性搜索交叉場，邊緣ngram分析器

Question

我有999個用於彈性搜索實驗的文檔。

我的類型映射中有一個字段f4，該字段經過分析，並為分析器進行了以下設置：

  "myNGramAnalyzer" => [
       "type" => "custom",
        "char_filter" => ["html_strip"],
        "tokenizer" => "standard",
        "filter" => ["lowercase","standard","asciifolding","stop","snowball","ngram_filter"]
  ]

我的過濾器如下：

  "filter" => [
        "ngram_filter" => [
            "type" => "edgeNGram",
            "min_gram" => "2",
            "max_gram" => "20"
        ]
  ]

我對字段f4的值為“ Proj1”，“ Proj2”，“ Proj3” ......等等。

現在，當我嘗試使用交叉字段搜索“ proj1”字符串時，我期望帶有“ Proj1”的文檔將以最大得分返回到響應的頂部。 但事實並非如此。 其余所有數據的內容幾乎相同。

另外我不明白為什么它匹配所有999文檔？

以下是我的搜索：

{
    "index": "myindex",
    "type": "mytype",
    "body": {
        "query": {
            "multi_match": {
                "query": "proj1",
                "type": "cross_fields",
                "operator": "and",
                "fields": "f*"
            }
        },
        "filter": {
            "term": {
                "deleted": "0"
            }
        }
    }
}

我的搜索結果是：

{
    "took": 12,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 999,
        "max_score": 1,
        "hits": [{
            "_index": "myindex",
            "_type": "mytype",
            "_id": "42",
            "_score": 1,
            "_source": {
                "f1": "396","f2": "125650","f3": "BH.1511AI.001",
                "f4": "Proj42",
                "f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0"
            }
        }, {
            "_index": "myindex",
            "_type": "mytype",
            "_id": "47",
            "_score": 1,
            "_source": {
                "f1": "396","f2": "137946","f3": "BH.152096.001",
                "f4": "Proj47",
                "f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0"
            }
        }, 
        //.......
        //.......
        //MANY RECORDS IN BETWEEN HERE
        //.......
        //.......
        {
            "_index": myindex,
            "_type": "mytype",
            "_id": "1",
            "_score": 1,
            "_source": {
                "f1": "396","f2": "142095","f3": "BH.705215.001",
                "f4": "Proj1",
                "f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0"
            }
        //.......
        //.......
        //MANY RECORDS IN BETWEEN HERE
        //.......
        //.......
        }]
    }
}

我做錯了什么還是想念什么？ （對於冗長的問題，我們深表歉意，但我認為應該提供所有可能的信息，並丟棄不必要的其他代碼）。

編輯：

術語向量響應

{
    "_index": "myindex",
    "_type": "mytype",
    "_id": "10",
    "_version": 1,
    "found": true,
    "took": 9,
    "term_vectors": {
        "f4": {
            "field_statistics": {
                "sum_doc_freq": 5886,
                "doc_count": 999,
                "sum_ttf": 5886
            },
            "terms": {
                "pr": {
                    "doc_freq": 999,
                    "ttf": 999,
                    "term_freq": 1,
                    "tokens": [{
                        "position": 0,
                        "start_offset": 0,
                        "end_offset": 6
                    }]
                },
                "pro": {
                    "doc_freq": 999,
                    "ttf": 999,
                    "term_freq": 1,
                    "tokens": [{
                        "position": 0,
                        "start_offset": 0,
                        "end_offset": 6
                    }]
                },
                "proj": {
                    "doc_freq": 999,
                    "ttf": 999,
                    "term_freq": 1,
                    "tokens": [{
                        "position": 0,
                        "start_offset": 0,
                        "end_offset": 6
                    }]
                },
                "proj1": {
                    "doc_freq": 111,
                    "ttf": 111,
                    "term_freq": 1,
                    "tokens": [{
                        "position": 0,
                        "start_offset": 0,
                        "end_offset": 6
                    }]
                },
                "proj10": {
                    "doc_freq": 11,
                    "ttf": 11,
                    "term_freq": 1,
                    "tokens": [{
                        "position": 0,
                        "start_offset": 0,
                        "end_offset": 6
                    }]
                }
            }
        }
    }
}

編輯2

字段f4的映射

"f4" : {
    "type" : "string",
    "index_analyzer" : "myNGramAnalyzer",
    "search_analyzer" : "standard"
}

我已更新為使用標准分析器查詢時間，這改善了結果，但仍達不到我的預期。

而不是999（所有文檔）現在返回111文檔，例如“ Proj1”，“ Proj11”，“ Proj111” ......“ Proj1”，“ Proj181” .........等。

仍然“ Proj1”位於結果之間，而不是頂部。

Answer 1

沒有index_analyzer （至少不是從Elasticsearch 1.7版開始）。 對於映射參數，可以使用analyzer和search_analyzer 。 請嘗試以下步驟以使其起作用。

使用分析器設置創建myindex：

PUT /myindex
{
   "settings": {
     "analysis": {
         "filter": {
            "ngram_filter": {
               "type": "edge_ngram",
               "min_gram": 2,
               "max_gram": 20
            }
         },
         "analyzer": {
            "myNGramAnalyzer": {
               "type": "custom",
               "tokenizer": "standard",
               "char_filter": "html_strip",
               "filter": [
                  "lowercase",
                  "standard",
                  "asciifolding",
                  "stop",
                  "snowball",
                  "ngram_filter"
               ]
            }
         }
      }
   }
}

將映射添加到mytype（為簡短起見，我僅映射了相關字段）：

PUT /myindex/_mapping/mytype
{
   "properties": {
      "f1": {
         "type": "string"
      },
      "f4": {
         "type": "string",
         "analyzer": "myNGramAnalyzer",
         "search_analyzer": "standard"
      },
      "deleted": {
         "type": "string"
      }
   }
}

索引一些數據：

PUT myindex/mytype/1
{
    "f1":"396",
    "f4":"Proj12" ,
    "deleted": "0"
}

PUT myindex/mytype/2
{
    "f1":"42",
    "f4":"Proj22" ,
    "deleted": "1"
}

現在嘗試查詢：

GET myindex/mytype/_search
{
   "query": {
      "multi_match": {
         "query": "proj1",
         "type": "cross_fields",
         "operator": "and",
         "fields": "f*"
      }
   },
   "filter": {
      "term": {
         "deleted": "0"
      }
   }
}

它應該返回文檔#1 。 它對我Sense 。 我正在使用Elasticsearch 2.X版本。

希望我能幫助到我:)

Answer 2

經過數小時的時間尋找解決方案之后，我終於使它工作了。

因此，我將所有內容與問題中提到的保持相同，在索引數據時使用n克分析儀。 我唯一需要更改的是將搜索查詢中的all字段與現有的multi-match查詢一起用作布爾查詢。

現在，我的搜索文本結果Proj1將返回我結果的順序，例如Proj1 ， Proj121 ， Proj11等。

雖然這並不像返回的確切順序Proj1 ， Proj11 ， Proj121等，但它仍然非常類似於我想要的結果。

彈性搜索交叉場，邊緣ngram分析器

問題描述

2 個解決方案

解決方案1
1 2016-05-15 20:25:20

解決方案2
0 已采納 2016-06-27 11:41:32

彈性搜索交叉場，邊緣ngram分析器

問題描述

2 個解決方案

解決方案1 1 2016-05-15 20:25:20

解決方案2 0 已采納 2016-06-27 11:41:32

解決方案1
1 2016-05-15 20:25:20

解決方案2
0 已采納 2016-06-27 11:41:32