简体   繁体   English

Dataflow中的自定义Apache Beam Python版本

[英]Custom Apache Beam Python version in Dataflow

I am wondering if it is possible to have a custom Apache Beam Python version running in Google Dataflow. 我想知道是否有可能在Google Dataflow中运行自定义Apache Beam Python版本。 A version that is not available in the public repositories (as of this writing: 0.6.0 and 2.0.0). 公共存储库中不可用的版本(截至本文时为0.6.0和2.0.0)。 For example, the HEAD version from the official repository of Apache Beam, or a specific tag for that matter. 例如,来自Apache Beam的官方存储库的HEAD版本,或者该问题的特定标记。

I am aware of the possibility of packing custom packages (private local ones for example) as described in the official documentation . 我知道有可能打包自定义包(例如私有本地包),如官方文档中所述 There are answered are questions here on how to do this for some other scripts. 这里有一些关于如何为其他脚本执行此操作的问题 And there is even a GIST guiding on this. 甚至有一个GIST 指导这个。

But I have not managed to get the current Apache Beam developing version (or a tagged one) that is available in the master branch of its official repository to get packaged and sent along my script to Google Dataflow. 但我还没有设法获得当前Apache Beam开发版本(或标记的版本),该版本可以在其官方存储库的主分支中获得,以便打包并将我的脚本发送到Google Dataflow。 For example, for the latest available tag, whose link for PiP to process would be: git+https://github.com/apache/beam.git@v2.1.0-RC2#egg=apache_beam[gcp]&subdirectory=sdks/python I get something like this: 例如,对于最新的可用标记,其PiP处理的链接将是: git+https://github.com/apache/beam.git@v2.1.0-RC2#egg=apache_beam[gcp]&subdirectory=sdks/python我得到的是这样的:

INFO:root:Executing command: ['.../bin/python', '-m', 'pip', 'install', '--download', '/var/folders/nw/m_035l9d7f1dvdbd7rr271tcqkj80c/T/tmpJhCkp8', 'apache-beam==2.1.0', '--no-binary', ':all:', '--no-deps']
DEPRECATION: pip install --download has been deprecated and will be removed in the future. Pip now has a download command that should be used instead.
Collecting apache-beam==2.1.0
  Could not find a version that satisfies the requirement apache-beam==2.1.0 (from versions: 0.6.0, 2.0.0)
No matching distribution found for apache-beam==2.1.0

Any ideas? 有任何想法吗? (I am wondering if it is even possible since Google Dataflow may have fixed the versions of Apache Beam that can run to the official released ones). (我想知道是否有可能,因为谷歌数据流可能已经修复了可以运行到官方发布的版本的Apache Beam版本)。

I will answer myself as I got the answer of this question at one Apache Beam's JIRA I have been helping with. 当我在一个我一直在帮助的Apache Beam的JIRA上得到这个问题的答案时,我会自己回答。

If you want to use a custom Apache Beam Python version in Google Cloud Dataflow (that is, run your pipeline with the --runner DataflowRunner , you must use the option --sdk_location <apache_beam_v1.2.3.tar.gz> when you run your pipeline; where <apache_beam_v1.2.3.tar.gz> is the location of the corresponding packaged version that you want to use. 如果您想在Google Cloud Dataflow中使用自定义Apache Beam Python版本(即使用--runner DataflowRunner运行管道,则必须在运行时使用选项--sdk_location <apache_beam_v1.2.3.tar.gz>管道;其中<apache_beam_v1.2.3.tar.gz>是您要使用的相应打包版本的位置。

For example, as of this writing, if you have checked out the HEAD version of the Apache Beam's git repository , you have to first package the repository by navigating to the Python SDK with cd beam/sdks/python and then run python setup.py sdist (a compressed tar file will be created in the dist subdirectory). 例如,在撰写本文时,如果您已经检查了Apache Beam的git 存储库HEAD版本,则必须首先通过使用cd beam/sdks/python导航到Python SDK来打包存储库,然后运行python setup.py sdist (将在dist子目录中创建一个压缩的tar文件)。

Thereafter you can run your pipeline like this: 此后,您可以像这样运行您的管道:

python your_pipeline.py [...your_options...] --sdk_location beam/sdks/python/dist/apache-beam-2.2.0.dev0.tar.gz

Google Cloud Dataflow will use the supplied SDK. Google Cloud Dataflow将使用提供的SDK。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 是否可以在 Apache 光束或谷歌云数据流中运行自定义 python 脚本 - Is it possible to run a custom python script in Apache beam or google cloud dataflow Dataflow / apache beam:管理自定义模块依赖项 - Dataflow/apache beam: manage custom module dependencies Apache Beam / Google数据流Python流自动缩放 - Apache Beam/Google dataflow Python streaming autoscaling apache 光束数据流管道(python)中的步骤的 If 语句 - If statement for steps in a apache beam dataflow pipeline (python) 在 Python 数据流/Apache Beam 上启动 CloudSQL 代理 - Start CloudSQL Proxy on Python Dataflow / Apache Beam 在 GCP 数据流上使用 python apache 光束中的 scipy - Using scipy in python apache beam on GCP Dataflow DataFlow (Apache Beam) 中发布/订阅的自定义时间戳和窗口 - Custom timestamp and windowing for Pub/Sub in DataFlow (Apache Beam) 如何在 Python 的 Apache-Beam DataFlow 中合并解析的文本文件? - How To Combine Parsed TextFiles In Apache-Beam DataFlow in Python? Python Apache Beam ImportError:数据流工作者上没有名为***的模块 - Python apache beam ImportError: No module named *** on dataflow worker 在数据流上运行Apache Beam Python的奇怪的腌制错误 - Weird pickling error running Apache Beam Python on Dataflow
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM