[英]Import huge data from MySQL to Elasticsearch
I am trying to import 32m rows from MySQL to Elasticsearch using Logstash, it works fine but breaks when reached 3,5m. 我正在尝试使用Logstash从MySQL导入32m行到Elasticsearch,它工作正常,但在达到3,5m时会中断。 Checked MySQL, Logstash it works fine, the problem in Elasticsearch please see logs:
检查过MySQL,Logstash工作正常,Elasticsearch中的问题请查看日志:
[2018-08-14T23:06:44,299][WARN ][o.e.x.s.a.s.m.NativeRoleMappingStore] [4OtmyM2] Failed to clear cache for realms [[]]
[2018-08-14T23:06:44,345][INFO ][o.e.l.LicenseService ] [4OtmyM2] license [23fbbbff-0ba9-44f5-be52-7f5a6498dbd1] mode [basic] - valid
[2018-08-14T23:06:44,368][INFO ][o.e.g.GatewayService ] [4OtmyM2] recovered [1] indices into cluster_state
[2018-08-14T23:06:46,120][INFO ][o.e.c.r.a.AllocationService] [4OtmyM2] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[clustername][2]] ...]).
[2018-08-14T23:55:55,780][INFO ][o.e.m.j.JvmGcMonitorService] [4OtmyM2] [gc][2953] overhead, spent [378ms] collecting in the last [1s]
I've increased heap size to 2GB, but it still can't handle it. 我已将堆大小增加到2GB,但仍然无法处理。 The configuration file for migration below:
用于迁移的配置文件如下:
input {
jdbc {
jdbc_connection_string => "jdbc:mysql://localhost:3306/clustername?useCursorFetch=true"
jdbc_user => "USER"
jdbc_password => "PSWD"
jdbc_validate_connection => true
jdbc_driver_library => "/usr/share/java/mysql-connector-java-5.1.42.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_paging_enabled => "true"
#jdbc_fetch_size => "50000"
jdbc_page_size => 100000
statement => "SELECT * FROM `video` ORDER by `id` ASC LIMIT 100000 OFFSET 3552984"
}
}
Thank you for any advice. 感谢您的任何建议。
You haven't provided enough data to help diagnose the problem. 您没有提供足够的数据来帮助诊断问题。 To properly index large amounts of data, you have to truly understand what the data is and how much storage it's going to take and how much memory it's going to use.
要正确索引大量数据,您必须真正了解数据是什么,将要占用多少存储空间以及将要使用多少内存。
Elasticsearch is not magic. Elasticsearch不是魔术。 You have to understand some things if you are going beyond a simple proof of concept.
如果您要超越简单的概念证明,就必须了解一些事情。 When you see things like gc overhead taking a significant time, you have to assume that you haven't properly sized your Elasticsearch cluster.
当看到诸如gc开销之类的事情花费大量时间时,您必须假设您没有正确调整Elasticsearch集群的大小。
Things that you need to consider: 您需要考虑的事情:
How much memory does elasticsearch need? elasticsearch需要多少内存?
How many nodes do you need 您需要多少个节点
There are many more things that you need to consider, but as a general rule, try to isolate where your fault is at -- ie remove SQL server / logstash from the mix by generating some random amount of data that looks like your real data so that you can gather the metrics needed to properly size your cluster. 您还需要考虑很多其他事情,但是作为一般规则,请尝试找出错误的根源所在,即通过生成一些看起来像真实数据的随机数据来从混合中删除SQL Server / logstash。您可以收集适当调整群集大小所需的指标。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.