I am currently running into an issue on which i am really stuck. I am trying to work on a problem where I have to output the Elasticsearch documents and write them to csv. The docs range from 50,000 to 5 million. I am experience serious performance issues and I get a feeling that I am missing something here.
Right now I have a dataset to 400,000 documents on which I am trying to scan and scroll and which would ultimately be formatted and written to csv. But the time taken to just output is 20 mins!! That is insane.
Here is my script:
import elasticsearch
import elasticsearch.exceptions
import elasticsearch.helpers as helpers
import time
es = elasticsearch.Elasticsearch(['http://XX.XXX.XX.XXX:9200'],retry_on_timeout=True)
scanResp = helpers.scan(client=es,scroll="5m",index='MyDoc',doc_type='MyDoc',timeout="50m",size=1000)
resp={}
start_time = time.time()
for resp in scanResp:
data = resp
print data.values()[3]
print("--- %s seconds ---" % (time.time() - start_time))
I am using a hosted AWS m3.medium server for Elasticsearch.
Can anyone please tell me what I might be doing wrong here?
A simple solution to output ES data to CSV is to use Logstash with an elasticsearch
input and a csv
output with the following es2csv.conf
config:
input {
elasticsearch {
host => "localhost"
port => 9200
index => "MyDoc"
}
}
filter {
mutate {
remove_field => [ "@version", "@timestamp" ]
}
}
output {
csv {
fields => ["field1", "field2", "field3"] <--- specify the field names you want
path => "/path/to/your/file.csv"
}
}
You can then export your data easily with bin/logstash -f es2csv.conf
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.