简体   繁体   English

局部敏感散列 - Elasticsearch

[英]Locality-sensitive hashing - Elasticsearch

is there any plugin allowing LSH on Elasticsearch? 有没有允许LSH在Elasticsearch上的插件? If yes, could you point me to the location and tell me a little how to use it? 如果是的话,你能指点我的位置,并告诉我一些如何使用它? Thanks 谢谢

Edit: I found out that ES uses MinHash plugin. 编辑:我发现ES使用MinHash插件。 How could I compare documents to one another with this? 我怎么能用这个比较文件呢? What would be a good setting to find duplicates? 找到重复的好设置是什么?

  1. There is a Elasticsearch MinHash Plugin . 有一个Elasticsearch MinHash插件 You can use it to extract minhash value every time you index a document and query the document by minhash later. 每次索引文档并稍后通过minhash查询文档时,您可以使用它来提取minhash值。

    1. Install MinHash plugin: 安装MinHash插件:

       $ $ES_HOME/bin/plugin install org.codelibs/elasticsearch-minhash/2.3.1 
    2. Add a minhash analyzer when creating your index: 创建索引时添加minhash分析器:

       $ curl -XPUT 'localhost:9200/my_index' -d '{ "index":{ "analysis":{ "analyzer":{ "minhash_analyzer":{ "type":"custom", "tokenizer":"standard", "filter":["minhash"] } } } } }' 
    3. Put minhash_value field into an index mapping: minhash_value字段放入索引映射中:

       $ curl -XPUT "localhost:9200/my_index/my_type/_mapping" -d '{ "my_type":{ "properties":{ "message":{ "type":"string", "copy_to":"minhash_value" }, "minhash_value":{ "type":"minhash", "minhash_analyzer":"minhash_analyzer" } } } }' 
    4. The minhash value is calculated automatically when adding document to the index you have created with minhash analyzer. 将文档添加到使用minhash analyzer创建的索引时,会自动计算minhash值。
    5. a. 一种。 Use More like this query can be used to do "like" search on the minhash_value field: 使用更多像这样的查询可用于在minhash_value字段上执行“喜欢”搜索:

       GET /_search { "query": { "more_like_this" : { "fields" : ["minhash_value"], "like" : "KV5rsUfZpcZdVojpG8mHLA==", "min_term_freq" : 1, "max_query_terms" : 12 } } } 

      b. You can also use fuzzy query but it accepts the query to differ from the result by 2 (maximum). 您也可以使用模糊查询,但它接受查询与结果2 (最大)不同。

       GET /_search { "query": { "fuzzy" : { "minhash_value" : "KV5rsUfZpcZdVojpG8mHLA==" } } } 

      You can find more about the fuzzy query here . 您可以在此处找到有关模糊查询的更多信息。

  2. Or you can create the hash value outside of elasicsearch (write a code to extract hash value) and everytime you index a document you can run the code and attach the hash value to the document you are indexing. 或者,您可以在elasicsearch之外创建哈希值(编写代码以提取哈希值),每次索引文档时,您都可以运行代码并将哈希值附加到要编制索引的文档中。 And later search with the hash value using More Like This query or Fuzzy query as described above. 然后使用更多像此查询模糊查询使用哈希值进行搜索,如上所述。
  3. Last but not least, you can write elasticsearch plugin yourself as above (which suits you hashing algorithm) and do the same step above. 最后但并非最不重要的是,您可以自己编写elasticsearch插件(适合您的哈希算法),并执行相同的步骤。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM