Python：如何使用 Apache Beam 连接到 Snowflake？

Question

I see that there's a built-in I/O connector for BigQuery, but a lot of our data is stored in Snowflake.我看到 BigQuery 有一个内置的 I/O 连接器，但我们的很多数据都存储在 Snowflake 中。 Is there a workaround for connecting to Snowflake?是否有连接到 Snowflake 的解决方法？ The only thing I can think of doing is to use sqlalchemy to run the query and then dump the output to Cloud Storage Buckets, and then Apache-Beam can get the input data from the files stored in the Bucket.我唯一能想到的就是使用 sqlalchemy 运行查询，然后将输出转储到 Cloud Storage Bucket，然后 Apache-Beam 可以从存储在 Bucket 中的文件中获取输入数据。

Answer 1

There were added Snowflake Python and Java connectors to Beam recently.最近向 Beam 添加了 Snowflake Python 和 Java 连接器。

Right now (version 2.24) it supports only ReadFromSnowflake operation in apache_beam.io.external.snowflake .现在（2.24 版）它只支持apache_beam.io.external.snowflake ReadFromSnowflake 操作。

In the 2.25 release WriteToSnowflake will also be available in apache_beam.io.snowflake module.在 2.25 版本中，WriteToSnowflake 也将在apache_beam.io.snowflake模块中可用。 You can still use the old path, however it will be considered deprecated in this version.您仍然可以使用旧路径，但在此版本中将被视为已弃用。

Right now it runs only on Flink Runner but there is an effort to make it available for other runners as well.目前它仅在 Flink Runner 上运行，但也正在努力使其也可用于其他运行器。

Also, it's a cross-language transform so some additional setup can be needed - it's quite well documented in the pydoc here (I'm pasting it below): https://github.com/apache/beam/blob/release-2.24.0/sdks/python/apache_beam/io/external/snowflake.py此外，它是一种跨语言转换，因此可能需要进行一些额外的设置 - 此处的 pydoc 中有很好的记录（我将其粘贴在下面）： https : //github.com/apache/beam/blob/release-2.24 .0/sdks/python/apache_beam/io/external/snowflake.py

Snowflake transforms tested against Flink portable runner.
  **Setup**
  Transforms provided in this module are cross-language transforms
  implemented in the Beam Java SDK. During the pipeline construction, Python SDK
  will connect to a Java expansion service to expand these transforms.
  To facilitate this, a small amount of setup is needed before using these
  transforms in a Beam Python pipeline.
  There are several ways to setup cross-language Snowflake transforms.
  * Option 1: use the default expansion service
  * Option 2: specify a custom expansion service
  See below for details regarding each of these options.
  *Option 1: Use the default expansion service*
  This is the recommended and easiest setup option for using Python Snowflake
  transforms.This option requires following pre-requisites
  before running the Beam pipeline.
  * Install Java runtime in the computer from where the pipeline is constructed
    and make sure that 'java' command is available.
  In this option, Python SDK will either download (for released Beam version) or
  build (when running from a Beam Git clone) a expansion service jar and use
  that to expand transforms. Currently Snowflake transforms use the
  'beam-sdks-java-io-expansion-service' jar for this purpose.
  *Option 2: specify a custom expansion service*
  In this option, you startup your own expansion service and provide that as
  a parameter when using the transforms provided in this module.
  This option requires following pre-requisites before running the Beam
  pipeline.
  * Startup your own expansion service.
  * Update your pipeline to provide the expansion service address when
    initiating Snowflake transforms provided in this module.
  Flink Users can use the built-in Expansion Service of the Flink Runner's
  Job Server. If you start Flink's Job Server, the expansion service will be
  started on port 8097. For a different address, please set the
  expansion_service parameter.
  **More information**
  For more information regarding cross-language transforms see:
  - https://beam.apache.org/roadmap/portability/
  For more information specific to Flink runner see:
  - https://beam.apache.org/documentation/runners/flink/

Snowflake (as most of the portable IOs) has its own java expansion service which should be downloaded automatically when you don't specify your own custom one. Snowflake（作为大多数便携式 IO）有自己的 java 扩展服务，当您不指定自己的自定义扩展服务时，应自动下载该服务。 I don't think it should be needed but I'm mentioning it just to be on the safe side.我不认为应该需要它，但我提到它只是为了安全起见。 You can download the jar and start it with java -jar <PATH_TO_JAR> <PORT> and then pass it to snowflake.ReadFromSnowflake as expansion_service='localhost:<PORT>' .您可以下载 jar 并使用java -jar <PATH_TO_JAR> <PORT>启动它，然后将其作为java -jar <PATH_TO_JAR> <PORT> expansion_service='localhost:<PORT>'传递给 snowflake.ReadFromSnowflake。 Link to 2.24 version: https://mvnrepository.com/artifact/org.apache.beam/beam-sdks-java-io-snowflake-expansion-service/2.24.0 2.24版本链接： https : //mvnrepository.com/artifact/org.apache.beam/beam-sdks-java-io-snowflake-expansion-service/2.24.0

Notice that it's still experimental though and feel free to report issues on Beam Jira.请注意，它仍然是实验性的，请随时报告 Beam Jira 上的问题。

Answer 2

Google Cloud Support here!谷歌云支持在这里！

There's no direct connector from Snowflake to Cloud Dataflow, but one workaround would be what you've mentioned.没有从 Snowflake 到 Cloud Dataflow 的直接连接器，但您提到的一种解决方法是。 First dump the output to Cloud Storage, and then connect Cloud Storage to Cloud Dataflow.首先将输出转储到 Cloud Storage，然后将 Cloud Storage 连接到 Cloud Dataflow。

I hope that helps.我希望这有帮助。

Answer 3

For future folks looking for a tutorial on how to start with Snowflake and Apache Beam, I can recommend the below tutorial which was made by the creators of the connector.对于未来正在寻找有关如何开始使用 Snowflake 和 Apache Beam 的教程的人，我可以推荐以下由连接器创建者制作的教程。

https://www.polidea.com/blog/snowflake-and-apache-beam-on-google-dataflow/ https://www.polidea.com/blog/snowflake-and-apache-beam-on-google-dataflow/

Python：如何使用 Apache Beam 连接到 Snowflake？

问题描述

3 个解决方案

解决方案1
6 已采纳 2020-09-23 07:58:02

解决方案2
1 2020-01-29 10:59:13

解决方案3
1 2020-11-05 07:25:56

Python：如何使用 Apache Beam 连接到 Snowflake？

问题描述

3 个解决方案

解决方案1 6 已采纳 2020-09-23 07:58:02

解决方案2 1 2020-01-29 10:59:13

解决方案3 1 2020-11-05 07:25:56

解决方案1
6 已采纳 2020-09-23 07:58:02

解决方案2
1 2020-01-29 10:59:13

解决方案3
1 2020-11-05 07:25:56