Elasticsearch Bulk Write is slow using Scan and Scroll

Question

I am currently running into an issue on which i am really stuck. I am trying to work on a problem where I have to output the Elasticsearch documents and write them to csv. The docs range from 50,000 to 5 million. I am experience serious performance issues and I get a feeling that I am missing something here.

Right now I have a dataset to 400,000 documents on which I am trying to scan and scroll and which would ultimately be formatted and written to csv. But the time taken to just output is 20 mins!! That is insane.

Here is my script:

import elasticsearch
import elasticsearch.exceptions 
import elasticsearch.helpers as helpers
import time

es =  elasticsearch.Elasticsearch(['http://XX.XXX.XX.XXX:9200'],retry_on_timeout=True)

scanResp = helpers.scan(client=es,scroll="5m",index='MyDoc',doc_type='MyDoc',timeout="50m",size=1000)

resp={}
start_time = time.time()
for resp in scanResp:
    data = resp
    print data.values()[3]

print("--- %s seconds ---" % (time.time() - start_time))

I am using a hosted AWS m3.medium server for Elasticsearch.

Can anyone please tell me what I might be doing wrong here?

Answer 1

A simple solution to output ES data to CSV is to use Logstash with an elasticsearch input and a csv output with the following es2csv.conf config:

input {
  elasticsearch {
   host => "localhost"
   port => 9200
   index => "MyDoc"
  }
}
filter {
 mutate {
  remove_field => [ "@version", "@timestamp" ]
 }
}
output {
 csv {
   fields => ["field1", "field2", "field3"]  <--- specify the field names you want 
   path => "/path/to/your/file.csv"
 }
}

You can then export your data easily with bin/logstash -f es2csv.conf

Elasticsearch Bulk Write is slow using Scan and Scroll

Question

1 answers

solution1
0 2015-10-23 11:36:22

Elasticsearch Bulk Write is slow using Scan and Scroll

Question

1 answers

solution1 0 2015-10-23 11:36:22

solution1
0 2015-10-23 11:36:22