简体   繁体   English

Google Cloud Dataflow 依赖项

[英]Google Cloud Dataflow Dependencies

I want to use dataflow to process in parallel a bunch of video clips I have stored in google storage.我想使用数据流并行处理我存储在谷歌存储中的一堆视频剪辑。 My processing algorithm has non-python dependencies and is expected to change over development iterations.我的处理算法具有非 Python 依赖项,预计会随着开发迭代而改变。


My preference would be to use a dockerized container with the logic to process the clips, but it appears that custom containers are not supported (in 2017):我的偏好是使用带有逻辑的 dockerized 容器来处理剪辑,但似乎不支持自定义容器(2017 年):

use docker for google cloud data flow dependencies 将 docker 用于谷歌云数据流依赖项

Although they may be supported now - since it was being worked on:尽管现在可能会支持它们 - 因为它正在处理:

Posthoc connect FFMPEG to opencv-python binary for Google Cloud Dataflow job Posthoc 将 FFMPEG 连接到用于 Google Cloud Dataflow 作业的 opencv-python 二进制文件

According to this issue a custom docker image may be pulled, but I couldn't find any documentation on how to do it with dataflow.根据此问题,可能会提取自定义 docker 映像,但我找不到有关如何使用数据流执行此操作的任何文档。

https://issues.apache.org/jira/browse/BEAM-6706?focusedCommentId=16773376&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16773376 https://issues.apache.org/jira/browse/BEAM-6706?focusedCommentId=16773376&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16773376

Another option might be to use setup.py to install any dependencies as described in this dated example:另一种选择可能是使用 setup.py 安装此过时示例中所述的任何依赖项:

https://cloud.google.com/blog/products/gcp/how-to-do-distributed-processing-of-landsat-data-in-python https://cloud.google.com/blog/products/gcp/how-to-do-distributed-processing-of-landsat-data-in-python

However, when running the example I get an error that there is no module named osgeo.gdal.但是,在运行示例时,我收到一个错误,提示没有名为 osgeo.gdal 的模块。

For pure python dependencies I have also tried to pass the --requirements_file argument, however I still get an error: Pip install failed for package: -r对于纯 python 依赖项,我也尝试传递--requirements_file参数,但是我仍然收到一个错误: Pip install failed for package: -r

I could find documentation for adding dependencies to apache_beam, but not to dataflow, and it appears the apache_beam instructions do not work, based on my tests of --requirements_file and --setup_file我可以找到向 apache_beam 添加依赖项的文档,但不能找到向数据流添加依赖项的文档,并且根据我对--requirements_file--setup_file测试, --requirements_file指令似乎--setup_file

This was answered in the comments, rewriting here for clarity:这在评论中得到了回答,为了清楚起见,在此处重写:

In Apache Beam you can modify the setup.py file while will be run once per container on start-up.在 Apache Beam 中,您可以修改 setup.py 文件,同时将在每个容器启动时运行一次。 This file allows you to perform arbitrary commands before the the SDK Harness start to receive commands from the Runner Harness.该文件允许您在 SDK Harness 开始从 Runner Harness 接收命令之前执行任意命令。

A complete example can be found in the Apache Beam repo.完整的示例可以在 Apache Beam 存储库中找到。

从 2020 年开始,您可以使用Dataflow Flex 模板,它允许您指定一个自定义 Docker 容器来执行您的管道。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM