如何在 python 中使用 ElasticSearch-dsl 自定義同義詞標記過濾器？

Question

我正在嘗試在 python 中使用 ElasticSearch-dsl 構建同義詞標記過濾器，例如，當我嘗試搜索“tiny”或“little”時，它也會返回包括“small”在內的文章。 這是我的代碼：

from elasticsearch_dsl import token_filter

# Connect to local host server
connections.create_connection(hosts=['127.0.0.1'])

spelling_tokenfilter = token_filter(
    'my_tokenfilter', # Name for the filter
    'synonym', # Synonym filter type
    synonyms_path = "analysis/wn_s.pl"
    )

# Create elasticsearch object
es = Elasticsearch()

text_analyzer = analyzer('my_tokenfilter',
                         type='custom',
                         tokenizer='standard',
                         filter=['lowercase', 'stop', spelling_tokenfilter])

我在 es-7.6.2/config 中創建了一個名為“analysis”的文件夾，並下載了 Wordnet prolog 數據庫並將“wn_s.pl”復制並粘貼到其中。 但是當我運行程序時，出現了一個錯誤：

Traceback (most recent call last):
  File "index.py", line 161, in <module>
    main()
  File "index.py", line 156, in main
    buildIndex()
  File "index.py", line 74, in buildIndex
    covid_index.create()
  File "C:\Anaconda\lib\site-packages\elasticsearch_dsl\index.py", line 259, in create
    return self._get_connection(using).indices.create(index=self._name, body=self.to_dict(), **kwargs)
  File "C:\Anaconda\lib\site-packages\elasticsearch\client\utils.py", line 92, in _wrapped
    return func(*args, params=params, headers=headers, **kwargs)
  File "C:\Anaconda\lib\site-packages\elasticsearch\client\indices.py", line 104, in create
    "PUT", _make_path(index), params=params, headers=headers, body=body
  File "C:\Anaconda\lib\site-packages\elasticsearch\transport.py", line 362, in perform_request
    timeout=timeout,
  File "C:\Anaconda\lib\site-packages\elasticsearch\connection\http_urllib3.py", line 248, in perform_request
    self._raise_error(response.status, raw_data)
  File "C:\Anaconda\lib\site-packages\elasticsearch\connection\base.py", line 244, in _raise_error
    status_code, error_message, additional_info
elasticsearch.exceptions.RequestError: RequestError(400, 'illegal_argument_exception', 'failed to build synonyms')

有人知道如何解決嗎？ 謝謝！

Answer 1

看起來發生這種情況是因為您在同義詞過濾器（ docs ）之前定義了lowercase和stop標記過濾器：

Elasticsearch 將使用標記器鏈中同義詞過濾器之前的標記過濾器來解析同義詞文件中的條目。 因此，例如，如果在詞干分析器之后放置同義詞過濾器，那么詞干分析器也將應用於同義詞條目。

首先，讓我們嘗試通過捕獲異常來獲取有關錯誤的更多詳細信息：

>>> text_analyzer = analyzer('my_tokenfilter',
...                          type='custom',
...                          tokenizer='standard',
...                          filter=[
...                              'lowercase', 'stop',
...                              spelling_tokenfilter
...                              ])
>>>
>>> try:
...   text_analyzer.simulate('blah blah')
... except Exception as e:
...   ex = e
...
>>> ex
RequestError(400, 'illegal_argument_exception', {'error': {'root_cause': [{'type': 'illegal_argument_exception', 'reason': 'failed to build synonyms'}], 'type': 'illegal_argument_exception', 'reason': 'failed to build synonyms', 'caused_by': {'type': 'parse_exception', 'reason': 'Invalid synonym rule at line 109', 'caused_by': {'type': 'illegal_argument_exception', 'reason': 'term: course of action analyzed to a token (action) with position increment != 1 (got: 2)'}}}, 'status': 400})

特別是這部分很有趣：

'原因'：'第 109 行無效的同義詞規則'，'caused_by'：{'type'：'illegal_argument_exception'，'reason'：'術語：分析到具有 position 增量的令牌（操作）的操作過程：= 1（得到：2)'}}}

這表明它設法找到了該文件，但未能解析它。

最后，如果您刪除這兩個令牌過濾器，錯誤就會消失：

text_analyzer = analyzer('my_tokenfilter',
                         type='custom',
                         tokenizer='standard',
                         filter=[
                             #'lowercase', 'stop',
                             spelling_tokenfilter
                             ])
...
>>> text_analyzer.simulate("blah")
{'tokens': [{'token': 'blah', 'start_offset': 0, 'end_offset...}

文檔建議使用多路復用器令牌過濾器，以防您需要組合這些。

希望這可以幫助！

如何在 python 中使用 ElasticSearch-dsl 自定義同義詞標記過濾器？

問題描述

1 個解決方案

解決方案1
1 已采納 2020-05-04 22:36:33

如何在 python 中使用 ElasticSearch-dsl 自定義同義詞標記過濾器？

問題描述

1 個解決方案

解決方案1 1 已采納 2020-05-04 22:36:33

解決方案1
1 已采納 2020-05-04 22:36:33