简体   繁体   English

如何从 Lucene 8.6.1 索引中获取所有令牌的列表?

[英]How to get a list of all tokens from Lucene 8.6.1 index?

I have looked at how to get a list of all tokens from Solr/Lucene index?我已经看过如何从 Solr/Lucene 索引中获取所有令牌的列表? but Lucene 8.6.1 doesn't seem to offer IndexReader.terms() .但是 Lucene 8.6.1 似乎没有提供IndexReader.terms() Has it been moved or replaced?它被移动或更换了吗? Is there an easier way than this answer ?有比这个答案更简单的方法吗?

Some History一些历史

You asked: I'm just wondering if IndexReader.terms() has moved or been replaced by an alternative.你问:我只是想知道IndexReader.terms()是否已经移动或被替代品取代。

The Lucene v3 method IndexReader.terms() was moved to AtomicReader in Lucene v4. Lucene v3 方法IndexReader.terms()在 Lucene v4 中移至AtomicReader This was documented in the v4 alpha release notes .这在v4 alpha 发行说明中有记录

(Bear in mind that Lucene v4 was released way back in 2012.) (请记住,Lucene v4 早在 2012 年就发布了。)

The method in AtomicReader in v4 takes a field name . v4 中AtomicReader中的方法采用字段 name

As the v4 release notes state:正如 v4 发行说明所述:

One big difference is that field and terms are now enumerated separately: a TermsEnum provides a BytesRef (wraps a byte[]) per term within a single field, not a Term.一个很大的区别是现在单独枚举字段和术语:TermsEnum 为单个字段中的每个术语提供一个 BytesRef(包装一个 byte[]),而不是一个术语。

The key part there is "per term within a single field" .关键部分是“单个字段中的每个术语” So from that point onward there was no longer a single API call to retrieve all terms from an index.因此,从那时起,不再需要通过单个 API 调用来检索索引中的所有术语。

This approach has carried through to later releases - except that the AtomicReader and AtomicReaderContext classes were renamed to LeafReader and LeafReaderContext in Lucene v 5.0.0.这种方法一直LeafReader到以后的版本——除了AtomicReaderAtomicReaderContext类在 Lucene v 5.0.0 中被重命名为LeafReaderLeafReaderContext See Lucene-5569 .请参阅Lucene-5569

Recent Releases最近发布

That leaves us with the ability to access lists of terms - but only on a per-field basis:这使我们能够访问术语列表 - 但仅限于每个字段:

The following code is based on the latest release of Lucene (8.7.0), but should also hold true for the version you mention (8.6.1) - with the example using Java:以下代码基于最新版本的 Lucene (8.7.0),但也适用于您提到的版本 (8.6.1) - 使用 Java 的示例:

private void getTokensForField(IndexReader reader, String fieldName) throws IOException {
    List<LeafReaderContext> list = reader.leaves();

    for (LeafReaderContext lrc : list) {
        Terms terms = lrc.reader().terms(fieldName);
        if (terms != null) {
            TermsEnum termsEnum = terms.iterator();

            BytesRef term;
            while ((term = termsEnum.next()) != null) {
                System.out.println(term.utf8ToString());
            }
        }
    }
}

The above example assumes an index as follows:上面的例子假设一个索引如下:

private static final String INDEX_PATH = "/path/to/index/directory";
...
IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(INDEX_PATH)));

If you need to enumerate field names, the code in this question may provide a starting point.如果您需要枚举字段名称, 此问题中的代码可能会提供一个起点。

Final Note最后说明

I guess you can also access terms on a per document basis, instead of a per field basis, as mentioned in the comments.我想您也可以按文档访问术语,而不是按字段访问,如评论中所述。 I have not tried this.我没有试过这个。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM