简体   繁体   English

从MySQL导入海量数据到Elasticsearch

[英]Import huge data from MySQL to Elasticsearch

I am trying to import 32m rows from MySQL to Elasticsearch using Logstash, it works fine but breaks when reached 3,5m. 我正在尝试使用Logstash从MySQL导入32m行到Elasticsearch,它工作正常,但在达到3,5m时会中断。 Checked MySQL, Logstash it works fine, the problem in Elasticsearch please see logs: 检查过MySQL,Logstash工作正常,Elasticsearch中的问题请查看日志:

[2018-08-14T23:06:44,299][WARN ][o.e.x.s.a.s.m.NativeRoleMappingStore] [4OtmyM2] Failed to clear cache for realms [[]]
[2018-08-14T23:06:44,345][INFO ][o.e.l.LicenseService     ] [4OtmyM2] license [23fbbbff-0ba9-44f5-be52-7f5a6498dbd1] mode [basic] - valid
[2018-08-14T23:06:44,368][INFO ][o.e.g.GatewayService     ] [4OtmyM2] recovered [1] indices into cluster_state
[2018-08-14T23:06:46,120][INFO ][o.e.c.r.a.AllocationService] [4OtmyM2] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[clustername][2]] ...]).
[2018-08-14T23:55:55,780][INFO ][o.e.m.j.JvmGcMonitorService] [4OtmyM2] [gc][2953] overhead, spent [378ms] collecting in the last [1s]

I've increased heap size to 2GB, but it still can't handle it. 我已将堆大小增加到2GB,但仍然无法处理。 The configuration file for migration below: 用于迁移的配置文件如下:

input {
    jdbc {
        jdbc_connection_string => "jdbc:mysql://localhost:3306/clustername?useCursorFetch=true"
        jdbc_user => "USER"
        jdbc_password => "PSWD"
        jdbc_validate_connection => true
        jdbc_driver_library => "/usr/share/java/mysql-connector-java-5.1.42.jar"
        jdbc_driver_class => "com.mysql.jdbc.Driver"
        jdbc_paging_enabled => "true"
        #jdbc_fetch_size => "50000"
        jdbc_page_size => 100000
        statement => "SELECT * FROM `video` ORDER by `id` ASC LIMIT 100000 OFFSET 3552984"
    }
}

Thank you for any advice. 感谢您的任何建议。

You haven't provided enough data to help diagnose the problem. 您没有提供足够的数据来帮助诊断问题。 To properly index large amounts of data, you have to truly understand what the data is and how much storage it's going to take and how much memory it's going to use. 要正确索引大量数据,您必须真正了解数据是什么,将要占用多少存储空间以及将要使用多少内存。

Elasticsearch is not magic. Elasticsearch不是魔术。 You have to understand some things if you are going beyond a simple proof of concept. 如果您要超越简单的概念证明,就必须了解一些事情。 When you see things like gc overhead taking a significant time, you have to assume that you haven't properly sized your Elasticsearch cluster. 当看到诸如gc开销之类的事情花费大量时间时,您必须假设您没有正确调整Elasticsearch集群的大小。

Things that you need to consider: 您需要考虑的事情:

  • How many shards do I need? 我需要多少个碎片?
    • The default # in the elasticsearch config file of 5 may work, or it may be too many or too few. elasticsearch配置文件中的默认#5可能有效,或者太多或太少。
    • Too many shards can cause elasticsearch to run out of memory. 分片过多会导致Elasticsearch内存不足。 Too few shards can cause bad performance. 分片太少会导致性能下降。
    • To aid in cluster recovery, your shards should not be large -- somewhere in the 2 GB to 4GB range should be considered "large" 为了帮助群集恢复,您的碎片不应太大-2 GB至4GB范围内的某个部分应视为“大”
    • Elasticsearch provides APIs to see how many shards you're using and how big they are Elasticsearch提供了API以查看您正在使用的分片以及它们的大小
  • How much memory does elasticsearch need? elasticsearch需要多少内存?

    • For a data node, the recommended usage is 50% of the system's RAM 对于数据节点,建议的使用量是系统RAM的50%
    • The 50% recommendation is related to allowing the OS to use the other 50% for disk cache 建议的50%与允许操作系统将其他50%用于磁盘缓存有关
    • if you are running other things on the nodes, you probably need to re-architect or adjust if performance allows 如果您正在节点上运行其他内容,则可能需要重新架构或调整性能是否允许
    • If your data is time-series based, you should probably be using time-series named indexes (with the frequency being yearly/monthly/weekly/daily depending on how many records per day are generated 如果您的数据是基于时间序列的,则您可能应该使用时间序列命名索引(频率是每年/每月/每周/每天,具体取决于每天生成的记录数量)
  • How many nodes do you need 您需要多少个节点

    • Without a 2nd node, you can't have replicas. 没有第二个节点,您将没有副本。
    • Without replicas, you will eventually lose data 没有副本,您最终将丢失数据
    • You need to have an odd number of master eligible nodes (otherwise you can get into a split-brain situation where your cluster is partitioned) 您需要具有奇数个符合条件的主节点(否则,您将陷入集群已分区的裂脑情况)
    • More nodes are better -- especially if you need a lot of shards 越多的节点越好-尤其是在您需要大量分片的情况下
  • How big is your data 您的数据有多大
    • you can reduce the size by configuring fields as keyword only fields (ie if you don't need to search certain fields, or only need to search based on _all) 您可以通过将字段配置为仅关键字字段来减小大小(即,如果您不需要搜索某些字段,或者只需要基于_all搜索)
    • How many fields are you using per record -- more fields = more ram per row 每个记录使用多少个字段-更多的字段=每行更多的RAM

There are many more things that you need to consider, but as a general rule, try to isolate where your fault is at -- ie remove SQL server / logstash from the mix by generating some random amount of data that looks like your real data so that you can gather the metrics needed to properly size your cluster. 您还需要考虑很多其他事情,但是作为一般规则,请尝试找出错误的根源所在,即通过生成一些看起来像真实数据的随机数据来从混合中删除SQL Server / logstash。您可以收集适当调整群集大小所需的指标。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM