是否可以使用 Apache Beam 从 MSSQL 数据库中读取数据？

Question

我一直在尝试使用 Apache Beam 连接到 Azure 数据库，并希望使用 pandas 在数据帧中加载一些数据。 为此，我一直在使用 apache_beam.io.jdbc 模块。

除了以下内容，我找不到关于该主题的任何真实文档： https://beam.apache.org/releases/pydoc/2.43.0/apache_beam.io.jdbc.html

import apache_beam as beam
from apache_beam.io.jdbc import ReadFromJdbc

with beam.Pipeline() as p:
        result = (p
                  | 'Read from jdbc' >> ReadFromJdbc(
                    fetch_size=None,
                    table_name='table_name',
                    driver_class_name='com.microsoft.sqlserver.jdbc.SQLServerDriver',
                    jdbc_url='jdbc:sqlserver://xxx:1433',
                    username='xxx',
                    password='xxx',
                    query='SELECT * from table_name',
                    connection_properties = ';database=xxx;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;'
                  )
                  |beam.Map(print)
                  )

我知道有更简单的方法可以做到这一点，但我需要这种方法才能使用 DataFlow 将数据提取到 GoogleCloud BigQuery。

Apache Beam 甚至打算从数据库加载数据吗？

Answer 1

如果你想在Beam上应用这个逻辑并从MSSQL数据库加载到BigQuery ，你可以使用纯Beam代码而不是使用 dataframe：

import apache_beam as beam
from apache_beam.io.jdbc import ReadFromJdbc
from apache_beam.options.pipeline_options import PipelineOptions

pipeline_options = PipelineOptions()

with beam.Pipeline(options=pipeline_options) as p:

        (
            p | 'Read from jdbc' >> ReadFromJdbc(
                table_name='jdbc_external_test_read',
                driver_class_name='com.microsoft.sqlserver.jdbc.SQLServerDriver',
                jdbc_url='jdbc:sqlserver://xxx:1433',
                username='postgres',
                password='postgres',
                classpath=['com.microsoft.sqlserver:mssql-jdbc:11.2.2.jre8'])
              | "Your transformation before BQ if neeeded" >> beam.Map(your_transform)
              | "write_hist_intraday" >> beam.io.WriteToBigQuery(
                project="project_id",
                dataset="dataset",
                table="table",
                create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER,
                write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
                
        )

def your_transform(element):
  # apply your transformation logic

使用ReadFromJdbc从MQSQL数据库中读取您的数据（查看详细信息以访问外部数据库）
然后可能在将数据写入BigQuery之前应用Map的转换
使用WriteToBigQuery IO 将结果写入BigQuery 。结果应该是与BigQuery表的架构匹配的 Python Dict 。

是否可以使用 Apache Beam 从 MSSQL 数据库中读取数据？

问题描述

1 个解决方案

解决方案1
2 2023-01-12 17:25:51

是否可以使用 Apache Beam 从 MSSQL 数据库中读取数据？

问题描述

1 个解决方案

解决方案1 2 2023-01-12 17:25:51

解决方案1
2 2023-01-12 17:25:51