简体   繁体   English

如何使用 GCP 云 SQL 作为数据流源和/或接收器与 Python?

[英]How to use GCP Cloud SQL as Dataflow source and/or sink with Python?

Is there any guidance available to use Google Cloud SQL as a Dataflow read source and/or sink?是否有任何指南可用于将 Google Cloud SQL 用作数据流读取源和/或接收器?

At the Apache Beam Python SDK 2.1.0 documentation there isn't a chapter mentioning Google Cloud SQL. But there is written about BigQuery.Apache Beam Python SDK 2.1.0 文档中,没有一章提到谷歌云 SQL。但是有关于 BigQuery 的文章。

And as I read tutorial Performing ETL from a Relational Database into BigQuery , I saw that they used exported data to file as a source in the process.当我阅读教程Performing ETL from a Relational Database into BigQuery时,我看到他们使用导出的数据来归档作为过程中的源。 That means there has to be an export step in between and that't not ideal.这意味着中间必须有一个导出步骤,这并不理想。

Are there specific issues you need to take care of when using Cloud SQL in specific?具体使用云SQL有没有需要注意的问题? For both source as sink?对于源和汇?

The Beam Python SDK does not have a built-in transform to read data from a MySQL/Postgres database. Beam Python SDK 没有从 MySQL/Postgres 数据库读取数据的内置转换。 Nonetheless, it should not be too troublesome to write a custom transform to do this.尽管如此,编写自定义转换来执行此操作应该不会太麻烦。 You can do something like this:你可以这样做:

with beam.Pipeline() as p:
  query_result_pc = (p 
                     | beam.Create(['select a,b,c from table1'])
                     | beam.ParDo(QueryMySqlFn(host='...', user='...'))
                     | beam.Reshuffle())

To connect to MySQL, we'll use the mysql-specific library mysql.connector, but you can use the appropriate library for Postgres/etc.要连接到 MySQL,我们将使用特定于 mysql 的库 mysql.connector,但您可以使用适合 Postgres/etc 的库。

Your querying function is:您的查询功能是:

import mysql.connector


class QueryMySqlFn(beam.DoFn):

  def __init__(self, **server_configuration):
    self.config = server_configuration

  def start_bundle(self):
      self.mydb = mysql.connector.connect(**self.config)
      self.cursor = mydb.cursor()

  def process(self, query):
    self.cursor.execute(query)
    for result in self.cursor:
      yield result

For Postgres, you would use psycopg2 or any other library that allows you to connect to it:对于 Postgres,您可以使用psycopg2或任何其他允许您连接到它的库:

import psycopg2

class QueryPostgresFn(beam.DoFn):

  def __init__(self, **server_config):
    self.config = server_config

  def process(self, query):
    con = psycopg2.connect(**self.config)
    cur = con.cursor()

    cur.execute(query)
    return cur.fetchall()

FAQ常见问题

  • Why do you have a beam.Reshuffle transform there?为什么你有一个beam.Reshuffle变换? - Because the QueryMySqlFn does not parallelize reading data from the database. - 因为QueryMySqlFn不会并行从数据库读取数据。 The reshuffle will ensure that our data is parallelized downstream for further processing.重新洗牌将确保我们的数据在下游并行化以进行进一步处理。

there is one good library https://github.com/pysql-beam/pysql-beam for SQL ingestion, please go through the examples, it supports RDBMS like MySQL and Postgresql.有一个很好的库https://github.com/pysql-beam/pysql-beam用于 SQL 摄取,请通过示例 go,它支持像 MySQL 和 Postgresql 这样的 RDBMS。

It has provided read and write the options like below we can read the data from google cloud SQL:它提供了读写选项如下我们可以从谷歌云SQL读取数据:

from pysql_beam.sql_io.sql import ReadFromSQL

....
ReadFromSQL(host=options.host, port=options.port,
        username=options.username, password=options.password,
        database=options.database,
        query=options.source_query,
        wrapper=PostgresWrapper,
        batch=100000)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM