How to parallel reIndex ElasticSearch

Question

I'm trying to reIndex ElasticSearch, I used Scan and Bulk API, but it's very slow, how can I parallel the process to make it faster. My python code as following:

actions=[]
for hit in helpers.scan(es,scroll='20m',index=INDEX,doc_type=TYPE,params=
     {"size":100}):
    value= hit.get('_source')
    idval = hit.get('_id')
    action = indexAction(INDEX_2,TYPE_2,idval,value)
    actions.append(action)
    count+=1
    if(count%200==0):
        helpers.bulk(es, actions,stats_only=True,params=
        {"consistency":"one","chunk_size":200})
        actions=[]

Should I make the scan multiple process or should I make the bulk multiple processes. I've bean wandering how does ElasticSearch-Hadoop to implement this. My index has 10 nodes and 20 shards.

Answer 1

On the elasticsearch side things are parallel. You are getting hits from each shard. But you can always add some clauses to your query and simply run multiple searches concurrently. For example a date range or numeric/alphabetical range might work well for this.

BTW. Since you are using python, your mileage may vary doing things concurrently with threads. I've had good experience forking of processes instead of threads with python. There used to be issues with eg having a global lock on the interpreter in python.

How to parallel reIndex ElasticSearch

Question

1 answers

solution1
0 2016-08-25 15:17:08

How to parallel reIndex ElasticSearch

Question

1 answers

solution1 0 2016-08-25 15:17:08

solution1
0 2016-08-25 15:17:08