簡體   English   中英

彈性搜索交叉場,邊緣ngram分析器

[英]Elastic search cross fields, edge ngram analyzer

我有999個用於彈性搜索實驗的文檔。

我的類型映射中有一個字段f4,該字段經過分析,並為分析器進行了以下設置:

  "myNGramAnalyzer" => [
       "type" => "custom",
        "char_filter" => ["html_strip"],
        "tokenizer" => "standard",
        "filter" => ["lowercase","standard","asciifolding","stop","snowball","ngram_filter"]
  ]

我的過濾器如下:

  "filter" => [
        "ngram_filter" => [
            "type" => "edgeNGram",
            "min_gram" => "2",
            "max_gram" => "20"
        ]
  ]

我對字段f4的值為“ Proj1”,“ Proj2”,“ Proj3” ......等等。

現在,當我嘗試使用交叉字段搜索“ proj1”字符串時,我期望帶有“ Proj1”的文檔將以最大得分返回到響應的頂部。 但事實並非如此。 其余所有數據的內容幾乎相同。

另外我不明白為什么它匹配所有999文檔?

以下是我的搜索:

{
    "index": "myindex",
    "type": "mytype",
    "body": {
        "query": {
            "multi_match": {
                "query": "proj1",
                "type": "cross_fields",
                "operator": "and",
                "fields": "f*"
            }
        },
        "filter": {
            "term": {
                "deleted": "0"
            }
        }
    }
}

我的搜索結果是:

{
    "took": 12,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 999,
        "max_score": 1,
        "hits": [{
            "_index": "myindex",
            "_type": "mytype",
            "_id": "42",
            "_score": 1,
            "_source": {
                "f1": "396","f2": "125650","f3": "BH.1511AI.001",
                "f4": "Proj42",
                "f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0"
            }
        }, {
            "_index": "myindex",
            "_type": "mytype",
            "_id": "47",
            "_score": 1,
            "_source": {
                "f1": "396","f2": "137946","f3": "BH.152096.001",
                "f4": "Proj47",
                "f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0"
            }
        }, 
        //.......
        //.......
        //MANY RECORDS IN BETWEEN HERE
        //.......
        //.......
        {
            "_index": myindex,
            "_type": "mytype",
            "_id": "1",
            "_score": 1,
            "_source": {
                "f1": "396","f2": "142095","f3": "BH.705215.001",
                "f4": "Proj1",
                "f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0"
            }
        //.......
        //.......
        //MANY RECORDS IN BETWEEN HERE
        //.......
        //.......
        }]
    }
}

我做錯了什么還是想念什么? (對於冗長的問題,我們深表歉意,但我認為應該提供所有可能的信息,並丟棄不必要的其他代碼)。

編輯:

術語向量響應

{
    "_index": "myindex",
    "_type": "mytype",
    "_id": "10",
    "_version": 1,
    "found": true,
    "took": 9,
    "term_vectors": {
        "f4": {
            "field_statistics": {
                "sum_doc_freq": 5886,
                "doc_count": 999,
                "sum_ttf": 5886
            },
            "terms": {
                "pr": {
                    "doc_freq": 999,
                    "ttf": 999,
                    "term_freq": 1,
                    "tokens": [{
                        "position": 0,
                        "start_offset": 0,
                        "end_offset": 6
                    }]
                },
                "pro": {
                    "doc_freq": 999,
                    "ttf": 999,
                    "term_freq": 1,
                    "tokens": [{
                        "position": 0,
                        "start_offset": 0,
                        "end_offset": 6
                    }]
                },
                "proj": {
                    "doc_freq": 999,
                    "ttf": 999,
                    "term_freq": 1,
                    "tokens": [{
                        "position": 0,
                        "start_offset": 0,
                        "end_offset": 6
                    }]
                },
                "proj1": {
                    "doc_freq": 111,
                    "ttf": 111,
                    "term_freq": 1,
                    "tokens": [{
                        "position": 0,
                        "start_offset": 0,
                        "end_offset": 6
                    }]
                },
                "proj10": {
                    "doc_freq": 11,
                    "ttf": 11,
                    "term_freq": 1,
                    "tokens": [{
                        "position": 0,
                        "start_offset": 0,
                        "end_offset": 6
                    }]
                }
            }
        }
    }
}

編輯2

字段f4的映射

"f4" : {
    "type" : "string",
    "index_analyzer" : "myNGramAnalyzer",
    "search_analyzer" : "standard"
}

我已更新為使用標准分析器查詢時間,這改善了結果,但仍達不到我的預期。

而不是999(所有文檔)現在返回111文檔,例如“ Proj1”,“ Proj11”,“ Proj111” ......“ Proj1”,“ Proj181” .........等。

仍然“ Proj1”位於結果之間,而不是頂部。

沒有index_analyzer (至少不是從Elasticsearch 1.7版開始)。 對於映射參數 ,可以使用analyzersearch_analyzer 請嘗試以下步驟以使其起作用。

使用分析器設置創建myindex:

PUT /myindex
{
   "settings": {
     "analysis": {
         "filter": {
            "ngram_filter": {
               "type": "edge_ngram",
               "min_gram": 2,
               "max_gram": 20
            }
         },
         "analyzer": {
            "myNGramAnalyzer": {
               "type": "custom",
               "tokenizer": "standard",
               "char_filter": "html_strip",
               "filter": [
                  "lowercase",
                  "standard",
                  "asciifolding",
                  "stop",
                  "snowball",
                  "ngram_filter"
               ]
            }
         }
      }
   }
}

將映射添加到mytype(為簡短起見,我僅映射了相關字段):

PUT /myindex/_mapping/mytype
{
   "properties": {
      "f1": {
         "type": "string"
      },
      "f4": {
         "type": "string",
         "analyzer": "myNGramAnalyzer",
         "search_analyzer": "standard"
      },
      "deleted": {
         "type": "string"
      }
   }
}

索引一些數據:

PUT myindex/mytype/1
{
    "f1":"396",
    "f4":"Proj12" ,
    "deleted": "0"
}

PUT myindex/mytype/2
{
    "f1":"42",
    "f4":"Proj22" ,
    "deleted": "1"
}

現在嘗試查詢:

GET myindex/mytype/_search
{
   "query": {
      "multi_match": {
         "query": "proj1",
         "type": "cross_fields",
         "operator": "and",
         "fields": "f*"
      }
   },
   "filter": {
      "term": {
         "deleted": "0"
      }
   }
}

它應該返回文檔#1 它對我Sense 我正在使用Elasticsearch 2.X版本。

希望我能幫助到我:)

經過數小時的時間尋找解決方案之后,我終於使它工作了。

因此,我將所有內容與問題中提到的保持相同,在索引數據時使用n克分析儀。 我唯一需要更改的是將搜索查詢中的all字段與現有的multi-match查詢一起用作布爾查詢。

現在,我的搜索文本結果Proj1將返回我結果的順序,例如Proj1Proj121Proj11等。

雖然這並不像返回的確切順序Proj1Proj11Proj121等,但它仍然非常類似於我想要的結果。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM