简体繁体中英

Solr indexing of a large data set

原文 2015-09-25 17:39:14 8 1 performance/ indexing/ solr/ large-data-volumes

I have content that is about 50 TB large. The number of documents in this set is about 250 million. The daily increment to this is not very large nay my be about 10000 documents of varying sizes totaling under 50 MB. The current indexing effort is taking way too long and is guesstimated to complete in 100+ days!!!
So ... is this really that large of a data set? To me, 50 TB of content (in this day and age) is not very large. Do you have content of this size? If you do, how did you improve time taken for one-time indexing? Also, how did you improve time taken by real-time indexing?
If you can answer .. great. If you can point me in the right direct direction ... appreciate that as well.

Thanks in advance.
rd

1 answers

There are number of factors to consider.

You can start with Client to index. Which client are you using. Is it Solrj, or any framework which listens to databases(like oracle or Hbase) or rest API. This can make a difference, given that Solr is good at handling them, however the client framework and data preparation at client, also needs to be optimized. For example, if you use Hbase Indexer(which reads from Hbase tables and writes to Solr), you can expect few millions to be indexed in hour or so. Then, this should not take much time to complete 250 million.
After the client, you enter into Solr environment. How many fields are you indexing in you document. Also do you have stored fields or any other overheads for field types.
Config parameters like autoCommit based on number of records or RAm size, softCommit as mentioned in the comment above, Parallel Threads to index data, Hardware are some of the points to cosider.

You can find comprehensive check list here and can verify each. Happy Designing

SOLR Out Of Memory Error when reading/indexing a large index

Solr (re)indexing database

Solr indexing performance

Solr slow while Indexing

indexing large array in mongoDB

Indexing a large XML file

DB performance with large set of data

MS Access - matching a small data set with a very large data set

Calculate term frequency on large data set

Manipulating large set of CSV data in memory

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question SOLR Out Of Memory Error when reading/indexing a large index Solr (re)indexing database Solr indexing performance Solr slow while Indexing indexing large array in mongoDB Indexing a large XML file DB performance with large set of data MS Access - matching a small data set with a very large data set Calculate term frequency on large data set Manipulating large set of CSV data in memory

Related Tags

Solr indexing of a large data set

Question

1 answers

solution1 0 2015-09-25 18:29:00

solution1
0 2015-09-25 18:29:00