I want to use dataflow to process in parallel a bunch of video clips I have stored in google storage. My processing algorithm has non-python dependencies and is expected to change over development iterations.
My preference would be to use a dockerized container with the logic to process the clips, but it appears that custom containers are not supported (in 2017):
use docker for google cloud data flow dependencies
Although they may be supported now - since it was being worked on:
Posthoc connect FFMPEG to opencv-python binary for Google Cloud Dataflow job
According to this issue a custom docker image may be pulled, but I couldn't find any documentation on how to do it with dataflow.
Another option might be to use setup.py to install any dependencies as described in this dated example:
However, when running the example I get an error that there is no module named osgeo.gdal.
For pure python dependencies I have also tried to pass the --requirements_file
argument, however I still get an error: Pip install failed for package: -r
I could find documentation for adding dependencies to apache_beam, but not to dataflow, and it appears the apache_beam instructions do not work, based on my tests of --requirements_file
and --setup_file
This was answered in the comments, rewriting here for clarity:
In Apache Beam you can modify the setup.py file while will be run once per container on start-up. This file allows you to perform arbitrary commands before the the SDK Harness start to receive commands from the Runner Harness.
A complete example can be found in the Apache Beam repo.
从 2020 年开始,您可以使用Dataflow Flex 模板,它允许您指定一个自定义 Docker 容器来执行您的管道。
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.