简体   繁体   中英

Solr indexing performance

We are experiencing some performance issues with Solr batch indexing: we have a cluster composed by 4 workers, each of which is equipped with 32 cores and 256GB of RAM. YARN is configured to use 100 vCores and 785.05GB of memory. The HDFS storage is managed by an EMC Isilon system connected through a 10Gb interface. Our cluster runs CDH 5.8.0, features Solr 4.10.3 and it is Kerberized.

With the current setup, speaking of compressed data, we can index about 25GB per day and 500GB per month by using MapReduce jobs. Some of these jobs run daily and they take almost 12 hours to index 15 GB of compressed data. In particular, MorphlineMapper jobs last approximately 5 hours and TreeMergeMapper last about 6 hours.

Are these performances normal? Can you suggest us some tweaks that could improve our indexing performances?

We are using the MapReduceIndexerTool and there are no network problems. We are reading compressed files from HDFS and decompressing them in our morphline. This is the way we run our script:

cmd_hdp=$(
HADOOP_OPTS="-Djava.security.auth.login.config=jaas.conf" hadoop --config /etc/hadoop/conf.cloudera.yarn \
jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar \
org.apache.solr.hadoop.MapReduceIndexerTool \
-D morphlineVariable.ZK_HOST=hostname1:2181/solr \
-D morphlineVariable.COLLECTION=my_collection \
-D mapreduce.map.memory.mb=8192 \
-D mapred.child.java.opts=-Xmx4096m \
-D mapreduce.reduce.java.opts=-Xmx4096m \
-D mapreduce.reduce.memory.mb=8192 \
--output-dir hdfs://isilonhostname:8020/tmp/my_tmp_dir \
--morphline-file morphlines/my_morphline.conf \
--log4j log4j.properties \
--go-live \
--collection my_collection \
--zk-host hostname1:2181/solr \
hdfs://isilonhostname:8020/my_input_dir/
)

The MorphlineMapper phase takes all available resources, the TreeMergeMapper takes only a couple of containers.

We don't need to make queries for the moment, we just need to index historical data. We are wondering if there is a way to speed up indexing time and then optimize collections for searching when indexing is complete.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM