简体   繁体   中英

Dataflow/apache beam: manage custom module dependencies

I have a .py pipeline using apache beam that import another module (.py), that is my custom module. I have a strucutre like this:

├── mymain.py
└── myothermodule.py

I import myothermodule.py in mymain.py like this:

import myothermodule

When I run locally on DirectRuner , I have no problem. But when I run it on dataflow with DataflowRunner , I have an error that tells:

ImportError: No module named myothermodule

So I want to know what should I do if I whant this module to be found when running the job on dataflow?

When you run your pipeline remotely, you need to make any dependencies available on the remote workers too. To do it you should put your module file in a Python package by putting it in a directory with a __init__.py file and creating a setup.py. It would look like this:

├── mymain.py
├── setup.py
└── othermodules
    ├── __init__.py
    └── myothermodule.py

And import it like this:

from othermodules import myothermodule

Then you can run you pipeline with the command line option --setup_file ./setup.py

A minimal setup.py file would look like this:

import setuptools

setuptools.setup(packages=setuptools.find_packages())

The whole setup is documented here .

And a whole example using this can be found here .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM