简体   繁体   中英

Lucene Indexing Performance

I have written a program to index database data to disk and I am not sure if my indexing speed is appropriate ie if I am very slow or not and if speed can be further improved.

Speed that I get is around 15000 Documents per Hour which amounts to around 2600 KB of Index Directory Size for creation of new indices.

I am using Lucene 6.0.0 and Windows 8.1 64 bit OS, 16 GB RAM and Intel Core i7 8 Core machine. I am doing indexing on local machine and not sure what kind of disks I have, its the usual one that comes with Windows PC.

I am using Spring Batch to INNER JOIN two database tables and get a Row Mapped Object from ItemReader then I prepare Document from this object.

I am always using method, writer.updateDocument(contentDuplicateKeyTerm, doc); and not addDocument(doc) since in Lucene 6.0.0 updateDocument adds a document to index if document doesn't already exist in addition to updating existing document.

I am not aware of any bench mark to compare my program to.

Please suggest.

EDIT: Now, I am able to achieve performance of around 1,80,000 documents per hour. Issue was doing IndexWriter.commit() after updating each Document , now I commit at regular intervals and that has improved performance greatly.

I was making multiple mistakes and that is why write performance was slow. Some of mistakes and rectifications were:

  1. I was committing after each document, so I changed the program to commit after each chunk, as I am using Spring Batch. Increasing commit interval improved performance significantly.

  2. I was closing and reopening writer instances unnecessarily ( initially the logic was designed to do so ). I changed the logic to maintain a single writer instance in the application scope and keep reusing it as needed.

  3. Source data was from a DB2 database and reading was slow from tables. I added indexes to increase read performance.

  4. Lucene writer is thread safe so I started writing in a multi threaded way instead of using a single thread.

So after increasing Lucene writer commit interval, indexing itself doesn't take as much time provided I have enough memory to hold large sets of documents. Document read and preparation doesn't take as much time. Lucene can index a few million documents in just a couple of minutes on modern machines.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM