简体繁体中英

Lucene Indexing Performance

原文 2016-10-06 06:47:15 2 1 java/ performance/ lucene

I have written a program to index database data to disk and I am not sure if my indexing speed is appropriate ie if I am very slow or not and if speed can be further improved.

Speed that I get is around 15000 Documents per Hour which amounts to around 2600 KB of Index Directory Size for creation of new indices.

I am using Lucene 6.0.0 and Windows 8.1 64 bit OS, 16 GB RAM and Intel Core i7 8 Core machine. I am doing indexing on local machine and not sure what kind of disks I have, its the usual one that comes with Windows PC.

I am using Spring Batch to INNER JOIN two database tables and get a Row Mapped Object from ItemReader then I prepare Document from this object.

I am always using method, writer.updateDocument(contentDuplicateKeyTerm, doc); and not addDocument(doc) since in Lucene 6.0.0 updateDocument adds a document to index if document doesn't already exist in addition to updating existing document.

I am not aware of any bench mark to compare my program to.

Please suggest.

EDIT: Now, I am able to achieve performance of around 1,80,000 documents per hour. Issue was doing IndexWriter.commit() after updating each Document , now I commit at regular intervals and that has improved performance greatly.

1 answers

I was making multiple mistakes and that is why write performance was slow. Some of mistakes and rectifications were:

I was committing after each document, so I changed the program to commit after each chunk, as I am using Spring Batch. Increasing commit interval improved performance significantly.
I was closing and reopening writer instances unnecessarily ( initially the logic was designed to do so ). I changed the logic to maintain a single writer instance in the application scope and keep reusing it as needed.
Source data was from a DB2 database and reading was slow from tables. I added indexes to increase read performance.
Lucene writer is thread safe so I started writing in a multi threaded way instead of using a single thread.

So after increasing Lucene writer commit interval, indexing itself doesn't take as much time provided I have enough memory to hold large sets of documents. Document read and preparation doesn't take as much time. Lucene can index a few million documents in just a couple of minutes on modern machines.

Indexing Performance in Apache Lucene

Lucene performance

Indexing and Searching Date in Lucene

Apache lucene indexing

incremental indexing lucene

Lucene Analyzer for Indexing and Searching

Lucene indexing html documents

Lucene Indexing with Semantics

lucene indexing objects in memory

Reset or clear Lucene indexing

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Indexing Performance in Apache Lucene Lucene performance Indexing and Searching Date in Lucene Apache lucene indexing incremental indexing lucene Lucene Analyzer for Indexing and Searching Lucene indexing html documents Lucene Indexing with Semantics lucene indexing objects in memory Reset or clear Lucene indexing

Related Tags

Lucene Indexing Performance

Question

1 answers

solution1 1 2017-01-12 04:14:43

solution1
1 2017-01-12 04:14:43