简体   繁体   English

如果同步到 Redshift/BigQuery,来自 MySQL 数据库的 bin 日志复制是否会保持唯一约束?

[英]Will bin log replication from a MySQL database maintain unique constraints if synched to Redshift/BigQuery?

We want to move our data warehouse from a MySQL database to either Redshift or BigQuery.我们希望将数据仓库从 MySQL 数据库迁移到 Redshift 或 BigQuery。

While optimised for OLAP operations, one of the disadvantages of these columns based databases is that they do not enforce unique constraints.虽然针对 OLAP 操作进行了优化,但这些基于列的数据库的缺点之一是它们不强制执行唯一约束。

As such, it is not impossible to have duplicate orders/products in your tables.因此,您的表格中存在重复的订单/产品并非不可能。 The industry we work for is retail and we use the standard Kimball facts and dimensions (star schema) database design.我们工作的行业是零售业,我们使用标准的 Kimball 事实和维度(星型模式)数据库设计。

One potential solution that was brought forward was to build the database in MySQL and to use a third-party replication tool to synch to data to Redshift/BigQuery.提出的一种潜在解决方案是在 MySQL 中构建数据库,并使用第三方复制工具将数据同步到 Redshift/BigQuery。 This way, we would enforce key constraints in the original MySQL db and we would use the Redshift/BigQuery only for read queries.这样,我们将在原始 MySQL 数据库中强制执行键约束,并且我们将仅将 Redshift/BigQuery 用于读取查询。

However, enforcing the constraints in MySQL and setting up a bin log replication to Redshift/BigQuery will keep the data identical to the one in MySQL and consequently enforcing unique constraints?但是,在 MySQL 中强制执行约束并设置到 Redshift/BigQuery 的 bin 日志复制将使数据与 MySQL 中的数据相同,从而强制执行唯一约束?

First of all, you cannot replicate from MySQL to RedShift/BigQuery.首先,您不能从 MySQL 复制到 RedShift/BigQuery。

Please understand that BigQuery is an analytical database.请理解 BigQuery 是一个分析型数据库。

What is advised you setup a replication from MySQL inside Cloud SQL.建议您从云 SQL 中的 MySQL 设置复制。 Then in BigQuery you can run now EXTERNAL_QUERY which means you can query/join your BQ database with Cloud SQL MySQL database.然后在 BigQuery 中,您现在可以运行 EXTERNAL_QUERY,这意味着您可以使用 Cloud SQL MySQL 数据库查询/加入您的 BQ 数据库。

  1. Setup replica from your current instance to a Cloud SQL instance, follow this guide .将当前实例的副本设置为 Cloud SQL 实例,请遵循本指南
  2. Understand how Cloud SQL federated queries let's you query from BigQuery Cloud SQL instances.了解Cloud SQL 联合查询如何让您从 BigQuery Cloud SQL 实例进行查询。

You get this way a live access to your relational database as:您可以通过这种方式实时访问您的关系数据库:

Example query that you run on BigQuery:您在 BigQuery 上运行的示例查询:

SELECT * EXTERNAL_QUERY(
'connection_id',
'''SELECT * FROM mysqltable AS c ORDER BY c.customer_id'');

You can even join Bigquery table with SQL table:您甚至可以将 Bigquery 表与 SQL 表连接起来:

Example:例子:

SELECT c.customer_id, c.name, SUM(t.amount) AS total_revenue,
rq.first_order_date
FROM customers AS c
INNER JOIN transaction_fact AS t ON c.customer_id = t.customer_id
LEFT OUTER JOIN EXTERNAL_QUERY(
  'connection_id',
  '''SELECT customer_id, MIN(order_date) AS first_order_date
  FROM orders
  GROUP BY customer_id''') AS rq ON rq.customer_id = c.customer_id
GROUP BY c.customer_id, c.name, rq.first_order_date;

The solution you put forward will allow:您提出的解决方案将允许:

  • to enforce unique key constraint on the source MySQL database对源 MySQL 数据库强制执行唯一键约束
  • to replicate/capture all changes that happen of that database to your data warehouse将该数据库发生的所有更改复制/捕获到您的数据仓库

That being said, what you end up with on your data warehouse is a view of all the events (insert, update, (delete: not supported by all SaaS offerings...) ) that have changed your MySQL DB.话虽如此,您最终在数据仓库中获得的是所有已更改 MySQL 数据库的事件(插入、更新、(删除:并非所有 SaaS 产品支持...))的视图。 Hence the "raw" tables in your warehouse will have multiple events per unique key of your MySQL and you would then need to reprocess these events to end up with the same tables as you have in your MySQL.因此,您的仓库中的“原始”表将具有每个 MySQL 的唯一键的多个事件,然后您需要重新处理这些事件以最终得到与 MySQL 中相同的表。

To illustrate this further: it's like if your MySQL tables at each point in time are a snapshot or frozen picture/state whereas what you get from binlog replication is the "movie" of all successive state changes of your database.为了进一步说明这一点:就像您的 MySQL 表在每个时间点都是快照或冻结的图片/状态,而您从二进制日志复制中获得的是数据库所有连续 state 更改的“电影”。 If you want a snapshot in your warehouse you then need to "replay" all the changes up to the point for which you want the snapshot for.如果您想在仓库中创建快照,则需要“重播”所有更改,直到您想要快照的位置。

This is pretty powerful in that you never loose any change happening on your database and can always find it back.这非常强大,因为您永远不会丢失数据库上发生的任何更改,并且总能找回它。 But it does incur additional work to get your data warehouse tables to the same "snapshot" shape of your input database.但它确实需要额外的工作来使您的数据仓库表与输入数据库的“快照”形状相同。

This can generally be done on your warehouse via a CTE that adds row_number() over (partition by id order by updated_at desc) as rn and then filter that CTE on where rn = 1 and deleted_at is null (with id being the column with your unique constraint, you can list multiple if your unique constraint is composite (on multiple keys) and updated_at being the timestamp of each change data capture event and deleted_at being the timestamp of delete events (or null if no delete events have happened for a given key) ).这通常可以通过 CTE 在您的仓库上完成,该 CTE 添加row_number() over (partition by id order by updated_at desc) as rn ,然后where rn = 1 and deleted_at is null的位置过滤该 CTE(其中id是您的列唯一约束,如果您的唯一约束是复合的(在多个键上),您可以列出多个,并且updated_at是每个更改数据捕获事件的时间戳,而deleted_at是删除事件的时间戳(或者 null 如果给定键没有发生删除事件))。

For open source and self-hosted change data capture, you can also look into things like Debezium that runs on Kafka Connect (or AWS Kinesis or others...) if that's infrastructure your client would be willing to invest in... Or just look at logical replication connections in your language of choice's database engine/lib for your preferred DB (eg I use psycopg2 (with extras ) for PostgreSQL on Python...)对于开源和自托管变更数据捕获,您还可以查看在 Kafka Connect(或 AWS Kinesis 或其他...)上运行的 Debezium 之类的东西,如果这是您的客户愿意投资的基础设施...或者只是在您选择的数据库引擎/lib 中查看您首选数据库的逻辑复制连接(例如,我在 Python 上为 PostgreSQL 使用psycopg2 (带有extras )...)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM