简体   繁体   中英

Google Cloud Dataflow Dependencies

I want to use dataflow to process in parallel a bunch of video clips I have stored in google storage. My processing algorithm has non-python dependencies and is expected to change over development iterations.


My preference would be to use a dockerized container with the logic to process the clips, but it appears that custom containers are not supported (in 2017):

use docker for google cloud data flow dependencies

Although they may be supported now - since it was being worked on:

Posthoc connect FFMPEG to opencv-python binary for Google Cloud Dataflow job

According to this issue a custom docker image may be pulled, but I couldn't find any documentation on how to do it with dataflow.

https://issues.apache.org/jira/browse/BEAM-6706?focusedCommentId=16773376&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16773376

Another option might be to use setup.py to install any dependencies as described in this dated example:

https://cloud.google.com/blog/products/gcp/how-to-do-distributed-processing-of-landsat-data-in-python

However, when running the example I get an error that there is no module named osgeo.gdal.

For pure python dependencies I have also tried to pass the --requirements_file argument, however I still get an error: Pip install failed for package: -r

I could find documentation for adding dependencies to apache_beam, but not to dataflow, and it appears the apache_beam instructions do not work, based on my tests of --requirements_file and --setup_file

This was answered in the comments, rewriting here for clarity:

In Apache Beam you can modify the setup.py file while will be run once per container on start-up. This file allows you to perform arbitrary commands before the the SDK Harness start to receive commands from the Runner Harness.

A complete example can be found in the Apache Beam repo.

从 2020 年开始,您可以使用Dataflow Flex 模板,它允许您指定一个自定义 Docker 容器来执行您的管道。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM