如何生成 _id 像 elasticsearch 但对于 apache lucene？

Question

I want to generate _id of Elasticsearch document the same way in apache Lucene, to have _id like Elasticsearch but in Apache Lucene. How can I do?我想在 apache Lucene 中以相同的方式生成 Elasticsearch 文档的 _id，使 _id 像 Elasticsearch 但在 Apache Lucene 中。我该怎么做？ Where Can I find algorithm that generate Elasticsearch _id?在哪里可以找到生成 Elasticsearch _id 的算法？

Answer 1

The algorithm is based on Flake IDs and can be found here: https://github.com/elastic/elasticsearch/blob/be7c7415627377a1b795400fb8dfcc6cbdf0e322/server/src/main/java/org/elasticsearch/common/TimeBasedUUIDGenerator.java#L49该算法基于Flake ID ，可在此处找到： https://github.com/elastic/elasticsearch/blob/be7c7415627377a1b795400fb8dfcc6cbdf0e322/server/src/main/java/org/elasticsearch/common/TimeBasedUUIDGenerator.java#L49

Answer 2

Apache Lucene doesn't have a direct equivalent to the "_id" field in Elasticsearch, but you can simulate this behavior by using a unique identifier as a field in your Lucene document. Apache Lucene 没有直接等效于 Elasticsearch 中的“_id”字段，但您可以通过使用唯一标识符作为 Lucene 文档中的字段来模拟此行为。

One way to generate a unique identifier is by using a UUID.生成唯一标识符的一种方法是使用 UUID。 You can use a library like Java's java.util.UUID to generate a unique identifier for each document and store it as a field in the Lucene document.您可以使用像 Java 的 java.util.UUID 这样的库为每个文档生成一个唯一标识符，并将其作为一个字段存储在 Lucene 文档中。

Another way is to use a hash value of your document as the identifier.另一种方法是使用文档的 hash 值作为标识符。 You can use a hashing algorithm like SHA-256 to hash the contents of your document and store the resulting hash value as the identifier field.您可以使用 SHA-256 等散列算法对文档的内容进行 hash，并将生成的 hash 值存储为标识符字段。

It's important to note that the Elasticsearch _id is not only unique but also deterministic.重要的是要注意 Elasticsearch _id 不仅是唯一的，而且是确定性的。 If you want to generate an _id that is deterministic in Apache Lucene, you need to use a specific, deterministic algorithm to generate the identifier.如果要生成一个在 Apache Lucene 中确定性的_id，则需要使用特定的确定性算法来生成标识符。

Here is an example in Java for generating a unique identifier for each document using java.util.UUID :这是 Java 中的一个示例，用于使用java.util.UUID为每个文档生成唯一标识符：

import java.util.UUID;
import org.apache.lucene.document.Document;

public class DocumentIDGenerator {
    public static String generateID(Document document) {
        return UUID.randomUUID().toString();
    }
}

In this example, the generateID method takes a Lucene Document object as input and returns a newly generated UUID as a string.在此示例中，generateID 方法将 Lucene 文档 object 作为输入，并以字符串形式返回新生成的 UUID。 To use this method, simply call it for each document before adding it to the index and store the returned identifier as a field in the document.要使用此方法，只需在将每个文档添加到索引之前调用它，并将返回的标识符存储为文档中的一个字段。

It's important to note that UUID is not deterministic, so if you need to generate a deterministic identifier, this method is not suitable.需要注意的是，UUID 不是确定性的，所以如果需要生成确定性标识符，这种方法不适用。 In that case, you may want to consider using a hash value as described in the next example.在这种情况下，您可能需要考虑使用 hash 值，如下一个示例所述。

Here is an example in Java for generating a deterministic identifier for each document using the SHA-256 hash value:以下是 Java 中的示例，用于使用 SHA-256 hash 值为每个文档生成确定性标识符：

import java.nio.charset.StandardCharsets;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import org.apache.lucene.document.Document;

public class DocumentIDGenerator {
    public static String generateID(Document document) {
        StringBuilder sb = new StringBuilder();
        for (IndexableField field : document.getFields()) {
            sb.append(field.stringValue());
        }
        String contents = sb.toString();
        try {
            MessageDigest digest = MessageDigest.getInstance("SHA-256");
            byte[] hash = digest.digest(contents.getBytes(StandardCharsets.UTF_8));
            StringBuilder hexString = new StringBuilder();
            for (byte b : hash) {
                hexString.append(String.format("%02X", b));
            }
            return hexString.toString();
        } catch (NoSuchAlgorithmException e) {
            throw new RuntimeException(e);
        }
    }
}

In this example, the generateID method takes a Lucene Document object as input and returns a hexadecimal string representation of the SHA-256 hash of the contents of the document.在此示例中，generateID 方法将 Lucene 文档 object 作为输入，并返回文档内容的 SHA-256 hash 的十六进制字符串表示形式。 To use this method, simply call it for each document before adding it to the index and store the returned identifier as a field in the document.要使用此方法，只需在将每个文档添加到索引之前调用它，并将返回的标识符存储为文档中的一个字段。

It's important to note that this is just one example, and there are many other ways to generate a deterministic identifier.请务必注意，这只是一个示例，还有许多其他方法可以生成确定性标识符。 You should choose the method that is most appropriate for your use case and requirements.您应该选择最适合您的用例和要求的方法。

Hope this help.希望这有帮助。

如何生成 _id 像 elasticsearch 但对于 apache lucene？

问题描述

2 个解决方案

解决方案1
2 2023-02-01 10:05:21

解决方案2
0 已采纳 2023-02-01 10:06:35

如何生成 _id 像 elasticsearch 但对于 apache lucene？

问题描述

2 个解决方案

解决方案1 2 2023-02-01 10:05:21

解决方案2 0 已采纳 2023-02-01 10:06:35

解决方案1
2 2023-02-01 10:05:21

解决方案2
0 已采纳 2023-02-01 10:06:35