简体繁体 English

来自多个MySQL数据库的ETL

[英]ETL from multiple MySQL databases

原文 2018-01-17 16:14:13 6 1 mysql/ bigdata/ amazon-redshift/ etl/ apache-kafka-streams

I have a system which extracts data from multiple MySQL databases with varying schemas, and performs many queries (with joins) and loads the output to another MySQL database. 我有一个系统，该系统从具有不同架构的多个MySQL数据库中提取数据，并执行许多查询（使用联接）并将输出加载到另一个MySQL数据库。

These queries were once just a quick fix, but they've grown to over 10000 lines and thus choke the source databases. 这些查询曾经只是一个快速解决方案，但现在已经增长到10000行以上，因此阻塞了源数据库。

I'm designing an efficient ETL pipeline by analyzing the SQL queries, but is there any temporary fix such as a tool which could analyze the queries and reduce the number of steps to reach the required schema? 我正在通过分析SQL查询来设计有效的ETL管道，但是是否有任何临时修复工具（例如可以分析查询并减少达到所需架构的步骤数的工具）？

Any help would be life-saving :) 任何帮助将挽救生命:)

1 个解决方案

Rather than perform a query on many MySQL databases (which are optimized for writes), you should move all of your queries to the Redshift database (which is optimized for reads). 而不是对许多MySQL数据库（针对写入进行了优化）执行查询，您应该将所有查询移至Redshift数据库（针对读取进行了优化）。

But to do this, you need the data. 但是要做到这一点，您需要数据。 Look into an ETL service that will clone ALL the data over to your Redshift. 查看一个ETL服务，该服务会将所有数据克隆到您的Redshift中。 We use Stitch Data but there are many players in the space. 我们使用针迹数据，但空间中有很多参与者。 You can set up multiple integrations, such that each MySQL db is pumping data into the same Redshift db (I'd recommend setting each up under a uniquely-named schema). 您可以设置多个集成，以便每个MySQL数据库将数据泵入相同的Redshift数据库中（我建议您在唯一命名的模式下进行设置）。

Once the data are all loaded, you can run your various queries in AWS Data Pipelines to create derived tables. 加载完所有数据后，您可以在AWS Data Pipelines中运行各种查询以创建派生表。 Each query can be each own job, that way you can monitor and modify on a per-query basis. 每个查询可以是每个作业，因此您可以在每个查询的基础上进行监视和修改。