简体   繁体   中英

Python apache beam ImportError: No module named *** on dataflow worker

Summary: Some local packages works and some doesn't

My beam application's structure:

-setup.py

-app/__init__.py
-app/main.py

-package1/__init__.py
-package1/one.py

-package2/__init__.py
-package2/two.py

-package3/__init__.py
-package3/three.py

In main.py:

from package1 import one
from package2 import two
from package3 import three

In setup.py

import setuptools

setuptools.setup(
    name='beam',
    version='1.0',
    install_requires=['apache-beam[gcp]',
                      'google-cloud==0.34.0',
                      'google-cloud-bigquery==0.25.0',
                      'requests==2.19.1',
                      'google-cloud-storage==1.12.0'
                      ],
    packages=setuptools.find_packages(),
)

When running, by using python -m app.main :

With direct runner (locally run), no problem.

With DataflowRunner (send to gogole dataflow), I have this error:

apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error: Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 642, in do_work work_executor.execute() File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 156, in execute op.start() File "apache_beam/runners/worker/operations.py", line 344, in apache_beam.runners.worker.operations.DoOperation.start def start(self): File "apache_beam/runners/worker/operations.py", line 345, in apache_beam.runners.worker.operations.DoOperation.start with self.scoped_start_state: File "apache_beam/runners/worker/operations.py", line 350, in apache_beam.runners.worker.operations.DoOperation.start pickler.loads(self.spec.serialized_fn)) File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py", line 244, in loads return dill.loads(s) File "/usr/local/lib/python2.7/dist-packages/dill/_dill.py", line 316, in loads return load(file, ignore) File "/usr/local/lib/python2.7/dist-packages/di ll/_dill.py", line 304, in load obj = pik.load() File "/usr/lib/python2.7/pickle.py", line 864, in load dispatchkey File "/usr/lib/python2.7/pickle.py", line 1096, in load_global klass = self.find_class(module, name) File "/usr/local/lib/python2.7/dist-packages/dill/_dill.py", line 465, in find_class return StockUnpickler.find_class(self, module, name) File "/usr/lib/python2.7/pickle.py", line 1130, in find_class import (module) ImportError: No module named three

This is "a bit" frustrating because I double/triple/... check what can be the difference between those packages, and they are the same. Sane __init__.py file (empty, no weird or hidden characters in them). Same type of structure in *.py . But for some reason, the package 3 just doesn't want to cooperate.

Does anyone have a solution for this problem?

Thank you.

It's been almost a year, but I had a very similar issue and was able to resolve it, so posting for others stumbling onto this page.

In my case, there is nothing special about package3.three , it just happens to be the first one that the worker tries to import. In fact, removing package3.three (eg by temporarily including its contents directly in main.py ) leads to the same error with one of the other modules.

While I don't fully understand the root cause, running with a file invocation python app/main.py rather than the module invocation python -m app.main resolved the issue. I'm guessing there is some conflict between the packaging in setup.py and the implicit packaging in module invocation.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM