简体   繁体   English

我可以将 google DataFlow 与本机 python 一起使用吗?

[英]Can I use google DataFlow with native python?

I'm trying to build a python ETL pipeline in google cloud, and google cloud dataflow seemed a good option.我正在尝试在谷歌云中构建一个 python ETL 管道,而谷歌云数据流似乎是一个不错的选择。 When I explored the documentation and the developer guides, I see that the apache beam is always attached to dataflow as it's based on it.当我浏览文档和开发人员指南时,我看到 apache beam 始终附加到数据流,因为它基于它。 I may find issues processing my dataframes in apache beam.我可能会在 apache beam 中发现处理我的数据帧的问题。

My questions are:我的问题是:

  • if I want to build my ETL script in native python with DataFlow is that possible?如果我想使用 DataFlow 在本机 python 中构建我的 ETL 脚本,这可能吗? Or it's necessary to use apache beam for my ETL?或者有必要为我的 ETL 使用 apache 光束?
  • If DataFlow was built just for the purpose of using Apache Beam?如果 DataFlow 只是为了使用 Apache Beam 而构建的? Is there any serverless google cloud tool for building python ETL (Google cloud function has 9 minutes time execution, that may cause some issues for my pipeline, I want to avoid in execution limit)是否有用于构建 python ETL 的无服务器谷歌云工具(谷歌云 function 执行时间为 9 分钟,这可能会导致我的管道出现一些问题,我想在执行限制中避免)

My pipeline aims to read data from BigQuery process it and re save it in a bigquery table.我的管道旨在从 BigQuery 处理它读取数据并将其重新保存在一个 bigquery 表中。 I may use some external APIs inside my script.我可能会在我的脚本中使用一些外部 API。

Concerning your first question, it looks like Dataflow was primarly written for using it along the Apache SDK, as can be checked in the official Google Cloud Documentation on Dataflow .关于你的第一个问题,看起来数据流主要是为了在 Apache SDK 中使用它而编写的,可以在数据流的官方谷歌云文档中查看 So, it is possible that's actually a requirement to use Apache Beam for your ETL.因此,实际上可能需要为您的 ETL 使用 Apache Beam。

Regarding your second question, this tutorial gives you a guidance on how to build your own ETL Pipeline with Python and Google Cloud Platform functions, which are actually serverless.关于您的第二个问题, 本教程将指导您如何使用 Python 和 Google Cloud Platform 函数构建自己的 ETL 管道,这些管道实际上是无服务器的。 Could you please confirm if this link has helped you?您能否确认此链接是否对您有所帮助?

Regarding your first question, Dataflow needs to use Apache Beam.关于你的第一个问题,Dataflow需要使用Apache Beam。 In fact, before Apache Beam there was something called Dataflow SDK, which was Google proprietary and then it was open sourced to Apache Beam.事实上,在 Apache Beam 之前,有一个叫做 Dataflow SDK 的东西,它是 Google 专有的,然后开源给 Apache Beam。

The Python Beam SDK is rather easy once you put a bit of effort into it, and the main process operations you'd need are very close to native Python language. Python Beam SDK 是相当容易的,一旦你付出了一些努力,你需要的主要流程操作非常接近原生 Python 语言。

If your end goal is to read, process and write to BQ, I'd say Beam + Dataflow is a good match.如果您的最终目标是读取、处理和写入 BQ,我会说 Beam + Dataflow 是一个很好的搭配。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 PipelineOptions 告诉数据流“use_unsupported_python_version”? - How can I tell Dataflow to "use_unsupported_python_version" with PipelineOptions? 无法在 Google Cloud Dataflow 虚拟机中使用 ping 命令? - Can't use ping command in Google Cloud Dataflow vm? Python 日志未出现在 Google 数据流中 - Python logs not appearing in Google Dataflow 作业开始后,我可以动态更改 Google Dataflow 中的日志级别吗? - Can I dynamically alter log levels in Google Dataflow once the job has started? 我可以配置 Google DataFlow 以在排空管道时保持节点正常运行吗 - Can I configure Google DataFlow to keep nodes up when I drain a pipeline 在谷歌云数据流中使用 experiments=no_use_multiple_sdk_containers - Use Of experiments=no_use_multiple_sdk_containers in Google cloud dataflow Google Cloud Dataflow 可以在 Go 中没有外部 IP 地址的情况下运行吗? - Can Google Cloud Dataflow be run without an external IP address in Go? 如何使用 GCP 云 SQL 作为数据流源和/或接收器与 Python? - How to use GCP Cloud SQL as Dataflow source and/or sink with Python? 如何使用自定义 Docker 图像运行 Python Google Cloud Dataflow 作业? - How to run a Python Google Cloud Dataflow job with a custom Docker image? 谷歌云数据流(Python):function 读取和写入 a.csv 文件? - Google Cloud Dataflow (Python): function to read from and write to a .csv file?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM