简体   繁体   English

部分单词搜索在使用 mongo-connector 的 elasticsearch (elasticsearch-py) 中不起作用

[英]Partial word search not working in elasticsearch (elasticsearch-py) using mongo-connector

Currently I've indexed my mongoDB collection into Elasticsearch running in a docker container.目前,我已将我的 mongoDB 集合索引到在 docker 容器中运行的 Elasticsearch 中。 I am able to query a document by it's exact name, but Elasticsearch is unable to match the query if it is only part of the name.我可以通过它的确切名称查询文档,但如果它只是名称的一部分,则 Elasticsearch 无法匹配查询。 Here is an example:下面是一个例子:

>>> es = Elasticsearch('0.0.0.0:9200')
>>> es.indices.get_alias('*')
{'mongodb_meta': {'aliases': {}}, 'sigstore': {'aliases': {}}, 'my-index': {'aliases': {}}}
>>> x = es.search(index='sigstore', body={'query': {'match': {'name': 'KEGG_GLYCOLYSIS_GLUCONEOGENESIS'}}})
>>> x
{'took': 198, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 1, 'relation': 'eq'}, 'max_score': 8.062855, 'hits': [{'_index': 'sigstore', '_type': 'sigs', '_id': '5d66c23228144432307c2c49', '_score': 8.062855, '_source': {'id': 1, 'name': 'KEGG_GLYCOLYSIS_GLUCONEOGENESIS', 'description': 'http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_GLYCOLYSIS_GLUCONEOGENESIS', 'members': ['ACSS2', 'GCK', 'PGK2', 'PGK1', 'PDHB', 'PDHA1', 'PDHA2', 'PGM2', 'TPI1', 'ACSS1', 'FBP1', 'ADH1B', 'HK2', 'ADH1C', 'HK1', 'HK3', 'ADH4', 'PGAM2', 'ADH5', 'PGAM1', 'ADH1A', 'ALDOC', 'ALDH7A1', 'LDHAL6B', 'PKLR', 'LDHAL6A', 'ENO1', 'PKM2', 'PFKP', 'BPGM', 'PCK2', 'PCK1', 'ALDH1B1', 'ALDH2', 'ALDH3A1', 'AKR1A1', 'FBP2', 'PFKM', 'PFKL', 'LDHC', 'GAPDH', 'ENO3', 'ENO2', 'PGAM4', 'ADH7', 'ADH6', 'LDHB', 'ALDH1A3', 'ALDH3B1', 'ALDH3B2', 'ALDH9A1', 'ALDH3A2', 'GALM', 'ALDOA', 'DLD', 'DLAT', 'ALDOB', 'G6PC2', 'LDHA', 'G6PC', 'PGM1', 'GPI'], 'user': 'naji.taleb@medimmune.com', 'type': 'public', 'level1': 'test', 'level2': 'test2', 'time': '08-28-2019 14:03:29 EDT-0400', 'source': 'File', 'mapped': [''], 'notmapped': [''], 'organism': 'human'}}]}}

When using the full name of the document, elasticsearch is able to successfully query it.当使用文档的全名时,elasticsearch 能够成功查询到它。 But this is what happens when I attempt to search part of the name or use a wildcard:但是当我尝试搜索名称的一部分或使用通配符时会发生这种情况:

>>> x = es.search(index='sigstore', body={'query': {'match': {'name': 'KEGG'}}})
>>> x
{'took': 17, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 0, 'relation': 'eq'}, 'max_score': None, 'hits': []}}



>>> x = es.search(index='sigstore', body={'query': {'match': {'name': 'KEGG*'}}})
>>> x
{'took': 3, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 0, 'relation': 'eq'}, 'max_score': None, 'hits': []}}

In addition to the default index settings I also tried making an index that allows the use of the nGram tokenizer to enable me to do partial search, but that also didn't work.除了默认索引设置之外,我还尝试创建一个索引,允许使用 nGram 标记器来进行部分搜索,但这也不起作用。 These are the settings I used for that index:这些是我用于该索引的设置:

{
  "sigstore": {
    "aliases": {},
    "mappings": {},
    "settings": {
      "index": {
        "max_ngram_diff": "99",
        "number_of_shards": "1",
        "provided_name": "sigstore",
        "creation_date": "1579200699718",
        "analysis": {
          "filter": {
            "substring": {
              "type": "nGram",
              "min_gram": "1",
              "max_gram": "20"
            }
          },
          "analyzer": {
            "str_index_analyzer": {
              "filter": [
                "lowercase",
                "substring"
              ],
              "tokenizer": "keyword"
            },
            "str_search_analyzer": {
              "filter": [
                "lowercase"
              ],
              "tokenizer": "keyword"
            }
          }
        },
        "number_of_replicas": "1",
        "uuid": "3nf915U6T9maLdSiJozvGA",
        "version": {
          "created": "7050199"
        }
      }
    }
  }
}

and this is the corresponding python command that created it:这是创建它的相应 python 命令:

es.indices.create(index='sigstore',body={"mappings": {},"settings": { 'index': { "analysis": {"analyzer": {"str_search_analyzer": {"tokenizer": "keyword","filter": ["lowercase"]},"str_index_analyzer": {"tokenizer": "keyword","filter": ["lowercase", "substring"]}},"filter": {"substring": {"type": "nGram","min_gram": 1,"max_gram": 20}}}},'max_ngram_diff': '99'}})

I use mongo-connector as the pipeline between my mongoDB collection and elasticsearch.我使用 mongo-connector 作为我的 mongoDB 集合和 elasticsearch 之间的管道。 This is the command I use to start it:这是我用来启动它的命令:

mongo-connector -m mongodb://username:password@xx.xx.xxx.xx:27017/?authSource=admin -t elasticsearch:9200 -d elastic2_doc_manager -n sigstore.sigs

I'm unsure as to why my elasticsearch is unable to get a partial match, and wondering if there is some setting I'm missing or if there's some crucial mistake I've made somewhere.我不确定为什么我的 elasticsearch 无法获得部分匹配,并想知道我是否遗漏了一些设置,或者我是否在某处犯了一些严重的错误。 Thanks for reading.谢谢阅读。

Versions版本

MongoDB 4.0.10 MongoDB 4.0.10

elasticsearch==7.1.0弹性搜索==7.1.0

elastic2-doc-manager[elastic5] elastic2-doc-manager[elastic5]

Updated after checked your gist:检查您的要点后更新:

You need to apply the mapping to your field as written in the doc, cf the first link I share in the comment.您需要按照文档中的描述将映射应用于您的字段,参见我在评论中分享的第一个链接。

You need to do it after applying the settings on your index according to the gist it's line 11.您需要根据第 11 行的要点在索引上应用设置后执行此操作。

Something like:就像是:

PUT /your_index/_mapping
{
  "properties": {
    "name": {
      "type": "keyword",
      "ignore_above": 256,
      "fields": {
        "str_search_analyzer": {
          "type": "text",
          "analyzer": "str_search_analyzer"
        }
      }
    }
  }
}

After you set the mapping need to apply it to your document, using update_by_query设置映射后需要将其应用到您的文档,使用 update_by_query

https://www.elastic.co/guide/en/elasticsearch/reference/master/docs-update-by-query.html https://www.elastic.co/guide/en/elasticsearch/reference/master/docs-update-by-query.html

So you can continue to search with term search on your field name as it will be indexed with a keyword mapping (exact match) and on the sub_field name.str_search_analyzer with part of the word.因此,您可以继续在您的字段名称上使用术语搜索进行搜索,因为它将使用关键字映射(完全匹配)和带有部分单词的 sub_field name.str_search_analyzer 进行索引。

your_keyword = 'KEGG_GLYCOLYSIS_GLUCONEOGENESIS' OR 'KEGG*'

x = es.search(index='sigstore', body={'query': {'bool': {'should':[{'term':  {'name': your_keyword}},
{'match': {'name.str_search_analyzer': your_keyword}}
]}}
})

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM