简体繁体 English

将 RDBMS 数据提取到 BigQuery

[英]Ingest RDBMS data to BigQuery

原文 2022-05-12 06:13:47 8 1 google-cloud-platform/ google-bigquery/ google-cloud-dataflow/ data-ingestion

If we have an on-prem sources like SQL-Server and Oracle. Data from it has to be ingested periodically in batch mode in Big Query.如果我们有像 SQL-Server 和 Oracle 这样的本地源。来自它的数据必须在 Big Query 中以批处理模式定期摄取。 What shud be the architecture?架构应该是什么？ Which GCP native services can be used for this?哪些 GCP 本机服务可用于此目的？ Can Dataflow or DataProc be used?可以使用 Dataflow 或 DataProc 吗？

PS: Our organization haven't licensed any third-party ETL tool so far. PS：到目前为止，我们的组织还没有授权任何第三方 ETL 工具。 Preference is for google native service.偏好是谷歌本地服务。 Data Fusion is very expensive.数据融合非常昂贵。

1 个解决方案

There are two approaches you can take with Apache Beam.对于 Apache Beam，您可以采用两种方法。

Periodically run a Beam/Dataflow batch job on your database.定期在您的数据库上运行 Beam/Dataflow 批处理作业。 You could use Beam's JdbcIO connector to read data.您可以使用 Beam 的JdbcIO连接器来读取数据。 After that you can transform your data using Beam transforms ( PTransforms ) and write to the destination using a Beam sink.之后，您可以使用 Beam 转换 ( PTransforms ) 转换数据并使用 Beam 接收器写入目标。 In this approach, you are responsible for handling duplicate data (for example, by providing different SQL queries across executions).在这种方法中，您负责处理重复数据（例如，通过跨执行提供不同的 SQL 查询）。
Use a Beam/Dataflow pipeline that can read change streams from a database.使用可以从数据库读取更改流的 Beam/Dataflow 管道。 The simplest approach here might be using one of the available Dataflow templates.这里最简单的方法可能是使用可用的数据流模板之一。 For example, see here .例如，请参见此处。 You can also develop your own pipeline using Beam's DebeziumIO connector.您还可以使用 Beam 的DebeziumIO连接器开发自己的管道。