简体   繁体   中英

How to parallel reIndex ElasticSearch

I'm trying to reIndex ElasticSearch, I used Scan and Bulk API, but it's very slow, how can I parallel the process to make it faster. My python code as following:

actions=[]
for hit in helpers.scan(es,scroll='20m',index=INDEX,doc_type=TYPE,params=
     {"size":100}):
    value= hit.get('_source')
    idval = hit.get('_id')
    action = indexAction(INDEX_2,TYPE_2,idval,value)
    actions.append(action)
    count+=1
    if(count%200==0):
        helpers.bulk(es, actions,stats_only=True,params=
        {"consistency":"one","chunk_size":200})
        actions=[]

Should I make the scan multiple process or should I make the bulk multiple processes. I've bean wandering how does ElasticSearch-Hadoop to implement this. My index has 10 nodes and 20 shards.

On the elasticsearch side things are parallel. You are getting hits from each shard. But you can always add some clauses to your query and simply run multiple searches concurrently. For example a date range or numeric/alphabetical range might work well for this.

BTW. Since you are using python, your mileage may vary doing things concurrently with threads. I've had good experience forking of processes instead of threads with python. There used to be issues with eg having a global lock on the interpreter in python.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM