简体   繁体   English

Solr索引性能

[英]Solr indexing performance

We are experiencing some performance issues with Solr batch indexing: we have a cluster composed by 4 workers, each of which is equipped with 32 cores and 256GB of RAM. 我们在Solr批处理索引中遇到一些性能问题:我们有一个由4个工作人员组成的集群,每个集群都配备32个核心和256GB的RAM。 YARN is configured to use 100 vCores and 785.05GB of memory. YARN配置为使用100个vCore和785.05GB内存。 The HDFS storage is managed by an EMC Isilon system connected through a 10Gb interface. HDFS存储由通过10Gb接口连接的EMC Isilon系统管理。 Our cluster runs CDH 5.8.0, features Solr 4.10.3 and it is Kerberized. 我们的集群运行CDH 5.8.0,具有Solr 4.10.3的功能,并且已Kerberized。

With the current setup, speaking of compressed data, we can index about 25GB per day and 500GB per month by using MapReduce jobs. 在当前设置下,谈到压缩数据,我们可以通过使用MapReduce作业来索引大约每天25GB和每月500GB。 Some of these jobs run daily and they take almost 12 hours to index 15 GB of compressed data. 其中一些作业每天运行,需要花费将近12个小时才能索引15 GB的压缩数据。 In particular, MorphlineMapper jobs last approximately 5 hours and TreeMergeMapper last about 6 hours. 特别是,MorphlineMapper作业持续约5个小时,TreeMergeMapper持续约6个小时。

Are these performances normal? 这些表演正常吗? Can you suggest us some tweaks that could improve our indexing performances? 您能否建议我们进行一些调整,以改善我们的索引编制性能?

We are using the MapReduceIndexerTool and there are no network problems. 我们正在使用MapReduceIndexerTool,并且没有网络问题。 We are reading compressed files from HDFS and decompressing them in our morphline. 我们正在从HDFS读取压缩文件,然后在我们的吗啉中解压缩它们。 This is the way we run our script: 这是我们运行脚本的方式:

cmd_hdp=$(
HADOOP_OPTS="-Djava.security.auth.login.config=jaas.conf" hadoop --config /etc/hadoop/conf.cloudera.yarn \
jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar \
org.apache.solr.hadoop.MapReduceIndexerTool \
-D morphlineVariable.ZK_HOST=hostname1:2181/solr \
-D morphlineVariable.COLLECTION=my_collection \
-D mapreduce.map.memory.mb=8192 \
-D mapred.child.java.opts=-Xmx4096m \
-D mapreduce.reduce.java.opts=-Xmx4096m \
-D mapreduce.reduce.memory.mb=8192 \
--output-dir hdfs://isilonhostname:8020/tmp/my_tmp_dir \
--morphline-file morphlines/my_morphline.conf \
--log4j log4j.properties \
--go-live \
--collection my_collection \
--zk-host hostname1:2181/solr \
hdfs://isilonhostname:8020/my_input_dir/
)

The MorphlineMapper phase takes all available resources, the TreeMergeMapper takes only a couple of containers. MorphlineMapper阶段占用所有可用资源,TreeMergeMapper仅占用几个容器。

We don't need to make queries for the moment, we just need to index historical data. 我们暂时不需要查询,只需要索引历史数据即可。 We are wondering if there is a way to speed up indexing time and then optimize collections for searching when indexing is complete. 我们想知道是否有一种方法可以加快索引编制时间,然后在索引编制完成后优化集合以进行搜索。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM