简体   繁体   English

如何获取Lucene索引中每个术语的发布列表

[英]How to get the postings list for each term in lucene index

I am reading a lucene index and I am able to retrieve the terms from the index. 我正在阅读Lucene索引,并且能够从索引中检索术语。 I want to get all the postings list for each term in lucene index. 我想获取每个单词在Lucene索引中的所有发布列表。 I am using lucene 7.4.0 jar. 我正在使用Lucene 7.4.0 jar。 Each document in this index consists of two fields (1) text_es or text_fr or text_en (2) DocId.Below is the code. 此索引中的每个文档都包含两个字段(1)text_es或text_fr或text_en(2)DocId。下面是代码。

public class LuceneTest {

public static void main(String[] args) {
    final String INDEX_DIRECTORY = "./index";
    Directory index;
        try {

            index = FSDirectory.open(Paths.get(INDEX_DIRECTORY));
            IndexReader indexReader = DirectoryReader.open(index);

            LeafReaderContext leafReaderContext_es = indexReader.leaves().get(0);
            LeafReaderContext leafReaderContext_fr = indexReader.leaves().get(1);
            LeafReaderContext leafReaderContext_en = indexReader.leaves().get(2);

            LinkedList<String> terms_es = new LinkedList<>();
            LinkedList<String> terms_en = new LinkedList<>();
            LinkedList<String> terms_fr = new LinkedList<>();

            LeafReader ir_es = leafReaderContext_es.reader();
            LeafReader ir_fr = leafReaderContext_fr.reader();
            LeafReader ir_en = leafReaderContext_en.reader();

            TermsEnum terms = ir_es.terms("text_es").iterator();
            BytesRef next = terms.next();
            while (next != null){
                terms_es.add(terms.term().utf8ToString());
                next = terms.next();
            }

            TermsEnum termsEnum_fr = ir_fr.terms("text_fr").iterator();
            BytesRef next_fr = termsEnum_fr.next();
            while (next_fr != null){
                terms_fr.add(termsEnum_fr.term().utf8ToString());
                next_fr = termsEnum_fr.next();
            }

            TermsEnum termsEnum_en = ir_en.terms("text_en").iterator();
            BytesRef next_en = termsEnum_en.next();
            while (next_en != null){
                terms_en.add(termsEnum_en.term().utf8ToString());
                next_en = termsEnum_en.next();
            }

            System.out.println("Espanish terms are as follows:");
            System.out.println(terms_es);

            System.out.println("French terms are as follows:");
            System.out.println(terms_fr);

            System.out.println("English terms are as follows:");
            System.out.println(terms_en);


        } catch (IOException e) {
            e.printStackTrace();
        }
}

I went through the documentation of lucene 7.4.0 and came across the method postings(Term term) which returns PostingsEnum for the specified term with PostingsEnum.FREQS. 我浏览了Lucene 7.4.0的文档,并遇到了方法postings(Term term),该方法使用PostingsEnum.FREQS返回指定期限的PostingsEnum。 The problem is that this method accepts parameter term of class Term but I am getting TermsEnum. 问题是此方法接受Term类的参数term,但是我正在获取TermsEnum。 How can convert this to Term class so that I can use the method postings to retrieve the corresponding postings list for each term. 如何将其转换为Term类,以便我可以使用方法发布来检索每个术语的相应发布列表。

Thanks. 谢谢。

I use lucene 8.2, you may try code below: 我使用lucene 8.2,您可以尝试以下代码:

    IndexReader indexReader = DirectoryReader.open(indexDir);
    Terms termVector = indexReader.getTermVector(0, "content");
    TermsEnum termIter = termVector.iterator();
    while (termIter.next() != null) {
        PostingsEnum postingsEnum = termIter.postings(null, PostingsEnum.ALL);
        while (postingsEnum.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
            int freq = postingsEnum.freq();
            System.out.printf("term: %s, freq: %d,", termIter.term().utf8ToString(), freq);
            while (freq > 0) {
                System.out.printf(" nextPosition: %d,", postingsEnum.nextPosition());
                System.out.printf(" startOffset: %d, endOffset: %d",
                        postingsEnum.startOffset(), postingsEnum.endOffset());
                freq--;
            }
            System.out.println();
        }
    }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM