我如何閱讀和打印Lucene Index 4.0

Question

我想從索引器文件中讀取索引。

因此，我想要的結果是每個文檔的所有條款和TF-IDF的數量。

請為我建議一些示例代碼。 謝謝：）

Answer 1

首先要獲得文件清單。 另一種選擇可能是遍歷索引項，但方法IndexReader.terms()似乎已從4.0中刪除（盡管它存在於AtomicReader ，值得一看）。 我知道獲取所有文檔的最好方法是簡單地通過文檔ID遍歷文檔：

//where reader is your IndexReader, however you go about opening/managing it
for (int i=0; i<reader.maxDoc(); i++) {
    if (reader.isDeleted(i))
        continue;
    //operate on the document with id = i ...
}

然后，您需要列出所有索引詞。 我假設我們對存儲字段沒有興趣，因為您想要的數據對它們沒有意義。 要檢索這些術語，可以使用IndexReader.getTermVectors(int) 。 注意，由於我們不需要直接訪問它，因此我實際上並沒有檢索該文檔。 從我們中斷的地方繼續：

String field;
FieldsEnum fieldsiterator;
TermsEnum termsiterator;
//To Simplify, you can rely on DefaultSimilarity to calculate tf and idf for you.
DefaultSimilarity freqcalculator = new DefaultSimilarity()
//numDocs and maxDoc are not the same thing:
int numDocs = reader.numDocs();
int maxDoc = reader.maxDoc();

for (int i=0; i<maxDoc; i++) {
    if (reader.isDeleted(i))
        continue;
    fieldsiterator = reader.getTermVectors(i).iterator();
    while (field = fieldsiterator.next()) {
        termsiterator = fieldsiterator.terms().iterator();
        while (terms.next()) {
            //id = document id, field = field name
            //String representations of the current term
            String termtext = termsiterator.term().utf8ToString();
            //Get idf, using docfreq from the reader.
            //I haven't tested this, and I'm not quite 100% sure of the context of this method.
            //If it doesn't work, idfalternate below should.
            int idf = termsiterator.docfreq();
            int idfalternate = freqcalculator.idf(reader.docFreq(field, termsiterator.term()), numDocs);
        }
    }
}

我如何閱讀和打印Lucene Index 4.0

問題描述

1 個解決方案

解決方案1
-1 2013-01-08 17:25:02

我如何閱讀和打印Lucene Index 4.0

問題描述

1 個解決方案

解決方案1 -1 2013-01-08 17:25:02

解決方案1
-1 2013-01-08 17:25:02