tl;dr Apache Beam pipeline step involes building docker image; How to run this pipeline using Google Dataflow? What alternatives exist?
I'm currently trying make my first steps with google's dataflow service and apache beam (python).
Trivial examples are pretty straight forward but things get confusing to me as soon as external software dependencies come into play. It seems to be possible to use custom docker containers to setup ones own environment [1][2]. While that's great for most dependencies, it doesn't help, if the dependency is docker itself, as it happens to be the case for me: One step of my pipeline involves using an external project which makes heavy use of docker (ie building images, running them)
As far as I can tell there are three options to tackle that problem:
Thanks!
[1] Custom VM images for Google Cloud Dataflow workers
[2] https://cloud.google.com/dataflow/docs/guides/using-custom-containers
[3] https://www.docker.com/blog/docker-can-now-run-within-docker/
Edit: Added line breaks.
Custom VM image for worker nodes Is it possible to use custom vm images for dataflow worker nodes?
It's not possible to completely replace the Dataflow worker. But you can use a custom Beam SDK Docker container as you noted. This will result in a Docker in Docker type execution for your case.
Don't use Google Dataflow What are better suited alternative services?
Please see here for other Beam runners and their capabilities.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.