简体   繁体   English

Lucene IndexWriter添加文档很慢

[英]Lucene IndexWriter slow to add documents

I wrote a small loop which added 10,000 documents into the IndexWriter and it took for ever to do it. 我写了一个小循环,它将10,000个文档添加到IndexWriter中,并且它始终需要这样做。

Is there another way to index large volumes of documents? 有没有其他方法可以索引大量文档?

I ask because when this goes live it has to load in 15,000 records. 我问,因为当它上线时,它必须加载15,000条记录。

The other question is how do I prevent having to load in all the records again when the web application is restarted? 另一个问题是,当重新启动Web应用程序时,如何防止必须再次加载所有记录?

Edit 编辑

Here is the code i used; 这是我使用的代码;

for (int t = 0; t < 10000; t++){
    doc = new Document();
    text = "Value" + t.toString();
    doc.Add(new Field("Value", text, Field.Store.YES, Field.Index.TOKENIZED));
    iwriter.AddDocument(doc);
};

Edit 2 编辑2

        Analyzer analyzer = new StandardAnalyzer();
        Directory directory = new RAMDirectory();

        IndexWriter iwriter = new IndexWriter(directory, analyzer, true);

        iwriter.SetMaxFieldLength(25000);

then the code to add the documents, then; 然后是添加文档的代码;

        iwriter.Close();

You should do this way to get the best performance. 你应该这样做以获得最佳性能。 on my machine i'm indexing 1000 document in 1 second 在我的机器上,我在1秒内索引1000个文档

1) You should reuse (Document, Field) not creating every time you add a document like this 1)每次添加这样的文档时,都应该重复使用(Document,Field)

private static void IndexingThread(object contextObj)
{
     Range<int> range = (Range<int>)contextObj;
     Document newDoc = new Document();
     newDoc.Add(new Field("title", "", Field.Store.NO, Field.Index.ANALYZED));
     newDoc.Add(new Field("body", "", Field.Store.NO, Field.Index.ANALYZED));
     newDoc.Add(new Field("newsdate", "", Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS));
     newDoc.Add(new Field("id", "", Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS));

     for (int counter = range.Start; counter <= range.End; counter++)
     {
         newDoc.GetField("title").SetValue(Entities[counter].Title);
         newDoc.GetField("body").SetValue(Entities[counter].Body);
         newDoc.GetField("newsdate").SetValue(Entities[counter].NewsDate);
         newDoc.GetField("id").SetValue(Entities[counter].ID.ToString());

         writer.AddDocument(newDoc);
     }
}

After that you could use threading and break your large collection into smaller ones, and use the above code for each section for example if you have 10,000 document you can create 10 Thread using ThreadPool and feed each section to one thread for indexing 之后你可以使用线程并将你的大集合分解成更小的集合,并为每个部分使用上面的代码,例如,如果你有10,000个文档,你可以使用ThreadPool创建10个线程,并将每个部分提供给一个线程进行索引

Then you will gain the best performance. 然后你将获得最佳表现。

Just checking, but you haven't got the debugger attached when you're running it have you? 只是检查,但你运行它时没有连接调试器吗?

This severely affects performance when adding documents. 这会严重影响添加文档时的性能。

On my machine (Lucene 2.0.0.4): 在我的机器上(Lucene 2.0.0.4):

Built with platform target x86: 使用平台目标x86构建:

  • No debugger - 5.2 seconds 没有调试器 - 5.2秒

  • Debugger attached - 113.8 seconds 附加调试器 - 113.8秒

Built with platform target x64: 使用平台目标x64构建:

  • No debugger - 6.0 seconds 没有调试器 - 6.0秒

  • Debugger attached - 171.4 seconds 附加调试器 - 171.4秒

Rough example of saving and loading an index to and from a RAMDirectory: 保存和加载RAMDirectory索引的粗略示例:

const int DocumentCount = 10 * 1000;
const string IndexFilePath = @"X:\Temp\tmp.idx";

Analyzer analyzer = new StandardAnalyzer();
Directory ramDirectory = new RAMDirectory();

IndexWriter indexWriter = new IndexWriter(ramDirectory, analyzer, true);

for (int i = 0; i < DocumentCount; i++)
{
    Document doc = new Document();
    string text = "Value" + i;
    doc.Add(new Field("Value", text, Field.Store.YES, Field.Index.TOKENIZED));
    indexWriter.AddDocument(doc);
}

indexWriter.Close();

//Save index
FSDirectory fileDirectory = FSDirectory.GetDirectory(IndexFilePath, true);
IndexWriter fileIndexWriter = new IndexWriter(fileDirectory, analyzer, true);
fileIndexWriter.AddIndexes(new[] { ramDirectory });
fileIndexWriter.Close();

//Load index
FSDirectory newFileDirectory = FSDirectory.GetDirectory(IndexFilePath, false);
Directory newRamDirectory = new RAMDirectory();
IndexWriter newIndexWriter = new IndexWriter(newRamDirectory, analyzer, true);
newIndexWriter.AddIndexes(new[] { newFileDirectory });

Console.WriteLine("New index writer document count:{0}.", newIndexWriter.DocCount());

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM