如何从 Lucene 8.6.1 索引中获取所有令牌的列表？

Question

我已经看过如何从 Solr/Lucene 索引中获取所有令牌的列表？ 但是 Lucene 8.6.1 似乎没有提供IndexReader.terms() 。 它被移动或更换了吗？ 有比这个答案更简单的方法吗？

Answer 1

一些历史

你问：我只是想知道IndexReader.terms()是否已经移动或被替代品取代。

Lucene v3 方法IndexReader.terms()在 Lucene v4 中移至AtomicReader 。 这在v4 alpha 发行说明中有记录。

（请记住，Lucene v4 早在 2012 年就发布了。）

v4 中AtomicReader中的方法采用字段 name 。

正如 v4 发行说明所述：

一个很大的区别是现在单独枚举字段和术语：TermsEnum 为单个字段中的每个术语提供一个 BytesRef（包装一个 byte[]），而不是一个术语。

关键部分是“单个字段中的每个术语” 。 因此，从那时起，不再需要通过单个 API 调用来检索索引中的所有术语。

这种方法一直LeafReader到以后的版本——除了AtomicReader和AtomicReaderContext类在 Lucene v 5.0.0 中被重命名为LeafReader和LeafReaderContext 。 请参阅Lucene-5569 。

最近发布

这使我们能够访问术语列表 - 但仅限于每个字段：

以下代码基于最新版本的 Lucene (8.7.0)，但也适用于您提到的版本 (8.6.1) - 使用 Java 的示例：

private void getTokensForField(IndexReader reader, String fieldName) throws IOException {
    List<LeafReaderContext> list = reader.leaves();

    for (LeafReaderContext lrc : list) {
        Terms terms = lrc.reader().terms(fieldName);
        if (terms != null) {
            TermsEnum termsEnum = terms.iterator();

            BytesRef term;
            while ((term = termsEnum.next()) != null) {
                System.out.println(term.utf8ToString());
            }
        }
    }
}

上面的例子假设一个索引如下：

private static final String INDEX_PATH = "/path/to/index/directory";
...
IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(INDEX_PATH)));

如果您需要枚举字段名称，此问题中的代码可能会提供一个起点。

最后说明

我想您也可以按文档访问术语，而不是按字段访问，如评论中所述。 我没有试过这个。

如何从 Lucene 8.6.1 索引中获取所有令牌的列表？

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-11-20 02:03:49

如何从 Lucene 8.6.1 索引中获取所有令牌的列表？

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-11-20 02:03:49

解决方案1
2 已采纳 2020-11-20 02:03:49