简体   繁体   English


[英]Elasticsearch Bulk Write is slow using Scan and Scroll

I am currently running into an issue on which i am really stuck. 我目前遇到的问题是我真的被卡住了。 I am trying to work on a problem where I have to output the Elasticsearch documents and write them to csv. 我正在尝试解决必须输出Elasticsearch文档并将其写入csv的问题。 The docs range from 50,000 to 5 million. 文档范围从50,000到500万。 I am experience serious performance issues and I get a feeling that I am missing something here. 我遇到了严重的性能问题,并且感觉到我在这里遗漏了一些东西。

Right now I have a dataset to 400,000 documents on which I am trying to scan and scroll and which would ultimately be formatted and written to csv. 现在,我有一个要扫描和滚动的400,000个文档的数据集,最终将对其进行格式化并写入csv。 But the time taken to just output is 20 mins!! 但是,仅输出所需的时间为20分钟! That is insane. 太疯狂了

Here is my script: 这是我的脚本:

import elasticsearch
import elasticsearch.exceptions 
import elasticsearch.helpers as helpers
import time

es =  elasticsearch.Elasticsearch(['http://XX.XXX.XX.XXX:9200'],retry_on_timeout=True)

scanResp = helpers.scan(client=es,scroll="5m",index='MyDoc',doc_type='MyDoc',timeout="50m",size=1000)

start_time = time.time()
for resp in scanResp:
    data = resp
    print data.values()[3]

print("--- %s seconds ---" % (time.time() - start_time))

I am using a hosted AWS m3.medium server for Elasticsearch. 我正在为Elasticsearch使用托管的AWS m3.medium服务器。

Can anyone please tell me what I might be doing wrong here? 谁能告诉我在这里我可能做错了什么?

A simple solution to output ES data to CSV is to use Logstash with an elasticsearch input and a csv output with the following es2csv.conf config: 一个简单的解决方案,以输出数据ES成CSV是使用Logstash与elasticsearch输入csv输出具有以下es2csv.conf配置:

input {
  elasticsearch {
   host => "localhost"
   port => 9200
   index => "MyDoc"
filter {
 mutate {
  remove_field => [ "@version", "@timestamp" ]
output {
 csv {
   fields => ["field1", "field2", "field3"]  <--- specify the field names you want 
   path => "/path/to/your/file.csv"

You can then export your data easily with bin/logstash -f es2csv.conf 然后,您可以使用bin/logstash -f es2csv.conf轻松导出数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM