Google Cloud Dataflow：ModuleNotFoundError：运行集成测试时没有名为“main”的模块

Question

I have an apache beam pipeline that works fine in both local and cloud modes.我有一个 apache 光束管道，在本地和云模式下都可以正常工作。 However, I have an end to end integration tests that I'm running in every MR, and the IT is submitted to Dataflow.但是，我在每个 MR 中运行端到端集成测试，并将 IT 提交给 Dataflow。

This time, the IT is throwing the following error:这一次，IT 抛出以下错误：

_import_module return __import__(import_name) ModuleNotFoundError: No module named 'main'

The stacktrace is not pointing at all to the place where the module is not recognised.堆栈跟踪根本没有指向无法识别模块的地方。 Just the follwing:只是以下内容：

job-v2-test-20-08160911-vs73-harness-drt8
      Root cause: Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/apache_beam/internal/dill_pickler.py", line 285, in loads
    return dill.loads(s)
  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 275, in loads
    return load(file, ignore, **kwds)
  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 270, in load
    return Unpickler(file, ignore=ignore, **kwds).load()
  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 472, in load
    obj = StockUnpickler.load(self)
  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 826, in _import_module
    return __import__(import_name)
ModuleNotFoundError: No module named 'main'

The main module I use only in the IT file, and it doesn't exist in any transformation of the pipeline.我只在 IT 文件中使用的主模块，它在管道的任何转换中都不存在。 Also, when I run the IT, half of the pipeline transformation runs successfully until it hangs with the provided error此外，当我运行 IT 时，一半的管道转换成功运行，直到它因提供的错误而挂起

The IT code: IT代码：

from main import run
import argparse

import unittest
import logging


class PipelineIT(unittest.TestCase):

    def setUp(self):

        self.test_pipeline = TestPipeline(is_integration_test=True)

        parser = argparse.ArgumentParser()
        self.args, self.beam_args = parser.parse_known_args()
        self.pipeline_options = PipelineOptions(self.beam_args)
        self.client = get_bq_instance()
        self.tables_timestamp = datetime.now().strftime("%Y%m%d%H%M")

    def test_mc_end_to_end(self):

        state_verifier = PipelineStateMatcher(PipelineState.DONE)
        extra_opts = {
            'input': IT_BUCKET,
            'output_dataset': IT_DATASET,
            'output': IT_OUTPUT,
            'bq_timestamp': self.tables_timestamp,
            'on_success_matcher':
                all_of(state_verifier)
        }

        run(self.test_pipeline.get_full_options_as_args(**extra_opts), save_main_session=True)

# buch of asserts

THe command I'm using to run the IT我用来运行 IT 的命令

coverage run -m  pytest --log-cli-level=INFO integration_tests/end_to_end_it_test.py --job_name "end_to_end_it" --test-pipeline-options=" --run_mode=cloud --mode=test --setup_file=path_to_setup.py"

The pipeline works fine in the production mode, but in the testing mode it shows that error.管道在生产模式下工作正常，但在测试模式下显示该错误。 I'm just wondering if the main is used only to trigger the integration test from local, how can it breaks the pipeline with the error我只是想知道main是否仅用于从本地触发集成测试，它怎么会因错误而中断管道

Answer 1

After deep investigation, in my pipeline, I was using beam.Filter in the following way:经过深入调查，在我的管道中，我通过以下方式使用了beam.Filter ：

dropped_and_missing = all_recs | 'Filter Dropped and Missing recs' >> beam.Filter(lambda rec: rec['existing_status'] == 'Dropped' or rec['existing_status'] == 'Missing')

Replacing the code block with a PTransformation that is based on if conditions solved the issue.将代码块替换为基于条件if解决了问题的PTransformation 。

I don't know where the issue is, I tried to dig into the source code, checking if there is in main module in the Filter function, but it doesn't exist.我不知道问题出在哪里，我试图挖掘源代码，检查Filter function的主模块中是否存在，但它不存在。

Also what's suspicious is the error is occurred only when running the integration test from the command line.另外值得怀疑的是，仅在从命令行运行集成测试时才会发生错误。 Pipeline works fine with LocalRunner and DataflowRunner管道与LocalRunner和DataflowRunner配合得很好

Google Cloud Dataflow：ModuleNotFoundError：运行集成测试时没有名为“main”的模块

问题描述

1 个解决方案

解决方案1
1 2022-08-18 14:42:36

Google Cloud Dataflow：ModuleNotFoundError：运行集成测试时没有名为“main”的模块

问题描述

1 个解决方案

解决方案1 1 2022-08-18 14:42:36

解决方案1
1 2022-08-18 14:42:36