為什么我使用 MinHash 分析器的查詢無法檢索重復項？

Question

我正在嘗試使用其MinHash 實現查詢 Elasticsearch 索引以查找近重復項。 我使用在容器中運行的 Python 客戶端來索引和執行搜索。

我的語料庫是一個有點像這樣的 JSONL 文件：

{"id":1, "text":"I'd just like to interject for a moment"}
{"id":2, "text":"I come up here for perception and clarity"}
...

我成功創建了一個 Elasticsearch 索引，嘗試使用自定義設置和分析器，從官方示例和MinHash 文檔中汲取靈感：

def create_index(client):
    client.indices.create(
        index="documents",
        body={
            "settings": {
                "analysis": {
                    "filter": {
                        "my_shingle_filter": {      
                        "type": "shingle",
                        "min_shingle_size": 5,
                        "max_shingle_size": 5,
                        "output_unigrams": False
                        },
                        "my_minhash_filter": {
                        "type": "min_hash",
                        "hash_count": 10,          
                        "bucket_count": 512,      
                        "hash_set_size": 1,       
                        "with_rotation": True     
                        }
                    },
                    "analyzer": {
                        "my_analyzer": {
                        "tokenizer": "standard",
                        "filter": [
                            "my_shingle_filter",
                            "my_minhash_filter"
                        ]
                        }
                    }
                }
            },
            "mappings": {
                "properties": {
                    "name": {"type": "text", "analyzer": "my_analyzer"}
                }
            },
        },
        ignore=400,
    )

我通過 Kibana 驗證索引創建沒有大問題，還通過訪問http://localhost:9200/documents/_settings我得到了一些看起來有序的東西：

但是，查詢索引：

def get_duplicate_documents(body, K, es):
    doc = {
        '_source': ['_id', 'body'],
        'size': K,
        'query': {
            "match": {
                "body": {
                    "query": body,
                    "analyzer" : "my_analyzer"
                }
            }
        }
    }

    res = es.search(index='documents', body=doc)
    top_matches = [hit['_source']['_id'] for hit in res['hits']['hits']]

我的res['hits']一直是空的，即使我將我的body設置為與我的語料庫中的一個條目的文本完全匹配。 換句話說，如果我嘗試作為body的值，我不會得到任何結果，例如

"I come up here for perception and clarity"

或類似的子串

"I come up here for perception"

雖然理想情況下，我希望程序返回近似重復項，其分數是通過 MinHash 獲得的查詢和近似重復項的 Jaccard 相似度的近似值。

我的查詢和/或索引 Elasticsearch 的方式有問題嗎？ 我完全錯過了其他東西嗎？

PS：你可以看看https://github.com/davidefiocco/dockerized-elasticsearch-duplicate-finder/tree/ea0974363b945bf5f85d52a781463fba76f4f987一個非功能性的，但希望可重現的例子（我也會更新回購協議，因為我找到了解決方案！）

Answer 1

以下是您應該仔細檢查的一些事項，因為它們很可能是罪魁禍首：

當您創建映射時，您應該在body參數內的client.indices.create方法中從“name”更改為“text”，因為您的 json 文檔有一個名為text的字段：
```
 "mappings": { "properties": { "text": {"type": "text", "analyzer": "my_analyzer"} }
```
在索引階段，您還可以按照文檔修改您的generate_actions()方法，例如：
```
 for elem in corpus: yield { "_op_type": "index" "_index": "documents", "_id": elem["id"], "_source": elem["text"] }
```
順便說一下，如果你正在索引pandas數據幀，你可能需要檢查實驗官方庫eland 。
此外，根據您的映射，您正在使用minhash令牌過濾器，因此 Lucene 將在 hash 中的text字段內轉換您的文本。因此您可以使用 hash 查詢此字段，而不是像您在示例中所做的那樣使用字符串"I come up here for perception and clarity" 。 所以最好的使用方法是檢索字段text的內容，然后在 Elasticsearch 中查詢檢索到的相同值。 然后_id元字段不在_source元字段內，因此您應該更改get_duplicate_documents()方法：
```
 def get_duplicate_documents(body, K, es): doc = { '_source': ['text'], 'size': K, 'query': { "match": { "text": { # I changed this line: "query". body } } } } res = es,search(index='documents', body=doc) # also changed the list comprehension! top_matches = [(hit['_id'], hit['_source']) for hit in res['hits']['hits']]
```

為什么我使用 MinHash 分析器的查詢無法檢索重復項？

問題描述

1 個解決方案

解決方案1
1 已采納 2020-08-03 14:21:44

為什么我使用 MinHash 分析器的查詢無法檢索重復項？

問題描述

1 個解決方案

解決方案1 1 已采納 2020-08-03 14:21:44

解決方案1
1 已采納 2020-08-03 14:21:44