简体   繁体   English

如何并行重新索引ElasticSearch

[英]How to parallel reIndex ElasticSearch

I'm trying to reIndex ElasticSearch, I used Scan and Bulk API, but it's very slow, how can I parallel the process to make it faster. 我正在尝试为ElasticSearch重新建立索引,我使用了Scan and Bulk API,但是它非常慢,如何并行处理才能使其更快。 My python code as following: 我的python代码如下:

actions=[]
for hit in helpers.scan(es,scroll='20m',index=INDEX,doc_type=TYPE,params=
     {"size":100}):
    value= hit.get('_source')
    idval = hit.get('_id')
    action = indexAction(INDEX_2,TYPE_2,idval,value)
    actions.append(action)
    count+=1
    if(count%200==0):
        helpers.bulk(es, actions,stats_only=True,params=
        {"consistency":"one","chunk_size":200})
        actions=[]

Should I make the scan multiple process or should I make the bulk multiple processes. 我应该执行扫描多个进程还是应该执行批量多个进程。 I've bean wandering how does ElasticSearch-Hadoop to implement this. 我一直在徘徊,ElasticSearch-Hadoop如何实现这一点。 My index has 10 nodes and 20 shards. 我的索引有10个节点和20个分片。

On the elasticsearch side things are parallel. 在elasticsearch方面,事情是平行的。 You are getting hits from each shard. 您正在从每个碎片中获得成功。 But you can always add some clauses to your query and simply run multiple searches concurrently. 但是,您始终可以向查询中添加一些子句,并且只需同时运行多个搜索即可。 For example a date range or numeric/alphabetical range might work well for this. 例如,日期范围或数字/字母范围可能适用于此。

BTW. 顺便说一句。 Since you are using python, your mileage may vary doing things concurrently with threads. 由于您使用的是python,因此与线程并发处理的工作量可能会有所不同。 I've had good experience forking of processes instead of threads with python. 我在处理进程而不是使用python线程方面有丰富的经验。 There used to be issues with eg having a global lock on the interpreter in python. 例如在python中的解释器上具有全局锁定曾经存在问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM