Lucene性能：將字段數據從一個索引傳輸到另一索引

Question

簡而言之，我需要交換從一個索引到結果索引的多個字段和值的映射。

以下是方案。

索引1結構[字段=>值] [存儲]

Doc 1    
keys => keyword1;    
Ids => id1, id1, id2, id3, id7, id11, etc.. 

Doc 2    
keys => keyword2;    
Ids => id3, id11, etc..

索引2結構[字段=>值] [存儲]

Doc 1    
ids => id1    
keys => keyword1, keyword1

Doc 3    
ids => id3    
keys => keyword1, keyword2, etc..

請注意，鍵<-> ids 映射在結果索引中是相反的。

在時間復雜度方面，您認為最有效的方法是什么？ ..

我能想到的唯一方法是..

1) index1Reader.terms();    
2) Process only terms belonging to "Ids" field    
3) For each term, get TermDocs    
4) For each doc, load it, get "keys" field info    
5) Create a new Lucene Doc, add 'Id', multi Keys, write it to index2.     
6) Go to step 2.

由於字段已存儲，因此我敢肯定有多種方法可以做到這一點。

請以任何演奏技巧指導我。 考慮到Index1大小約為6GB， 即使是最細微的改進也會對我的情況產生巨大影響。

總數 唯一關鍵字的數量：1800萬； 總數 ID的總數：90萬

有趣的更新

優化1

在添加新文檔時，與其創建多個重復的“字段”對象，不如創建帶有“”分隔符的單個StringBuffer，然后將整個字段添加為單個字段，似乎可以提高25％。

更新2：代碼

    public void go() throws IOException, ParseException {
    String id = null;
    int counter = 0;
    while ((id = getNextId()) != null) { // this method is not taking time..
        System.out.println("Node id: " + id);
        updateIndex2DataForId(id);
        if(++counter > 10){
            break;
        }
    }
    index2Writer.close();
}

private void updateIndex2DataForId(String id) throws ParseException, IOException {
    // Get all terms containing the node id
    TermDocs termDocs = index1Reader.termDocs(new Term("id", id));
    // Iterate
    Document doc = new Document();
    doc.add(new Field("id", id, Store.YES, Index.NOT_ANALYZED));
    int docId = -1;        
    while (termDocs.next()) {
        docId = termDocs.doc();
        doc.add(getKeyDataAsField(docId, Store.YES, Index.NOT_ANALYZED));            
    }
    index2Writer.addDocument(doc);
}

private Field getKeyDataAsField(int docId, Store storeOption, Index indexOption) throws CorruptIndexException,
        IOException {
    Document doc = index1Reader.document(docId, fieldSelector); // fieldSel has "key"
    Field f = new Field("key", doc.get("key"), storeOption, indexOption);
    return f;
}

Answer 1

使用FieldCache就像是一種魅力……但是，我們需要分配越來越多的RAM以容納堆中的所有字段。

我已使用以下代碼段更新了上述updateIndex2DataForId（）。

private void updateIndex2DataForId(String id) throws ParseException, IOException {
    // Get all terms containing the node id
    TermDocs termDocs = index1Reader.termDocs(new Term("id", id));
    // Iterate
    Document doc = new Document();
    doc.add(new Field("id", id, Store.YES, Index.NOT_ANALYZED));
    int docId = -1;
    StringBuffer buffer = new StringBuffer();
    while (termDocs.next()) {
        docId = termDocs.doc();
        buffer .append(keys[docId] + " "); // keys[] is pre-populated using FieldCache                 
    }
    doc.add(new Field("id", buffer.trim().toString(), Store.YES, Index.ANALYZED));   
    index2Writer.addDocument(doc);
}

String[] keys = FieldCache.DEFAULT.getStrings(index1Reader, "keywords");

它使一切變得更快，我無法告訴您確切的指標，但我必須說非常重要。

現在程序將在合理的時間內完成。 無論如何，高度贊賞進一步的指導。

Lucene性能：將字段數據從一個索引傳輸到另一索引

問題描述

1 個解決方案

解決方案1
0 已采納 2012-08-01 05:12:23

Lucene性能：將字段數據從一個索引傳輸到另一索引

問題描述

1 個解決方案

解決方案1 0 已采納 2012-08-01 05:12:23

解決方案1
0 已采納 2012-08-01 05:12:23