[英]Google Cloud Dataflow: ModuleNotFoundError: No module named 'main' when running integration test
I have an apache beam pipeline that works fine in both local and cloud modes.我有一个 apache 光束管道,在本地和云模式下都可以正常工作。 However, I have an end to end integration tests that I'm running in every MR, and the IT is submitted to Dataflow.
但是,我在每个 MR 中运行端到端集成测试,并将 IT 提交给 Dataflow。
This time, the IT is throwing the following error:这一次,IT 抛出以下错误:
_import_module return __import__(import_name) ModuleNotFoundError: No module named 'main'
The stacktrace is not pointing at all to the place where the module is not recognised.堆栈跟踪根本没有指向无法识别模块的地方。 Just the follwing:
只是以下内容:
job-v2-test-20-08160911-vs73-harness-drt8
Root cause: Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/apache_beam/internal/dill_pickler.py", line 285, in loads
return dill.loads(s)
File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 275, in loads
return load(file, ignore, **kwds)
File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 270, in load
return Unpickler(file, ignore=ignore, **kwds).load()
File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 472, in load
obj = StockUnpickler.load(self)
File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 826, in _import_module
return __import__(import_name)
ModuleNotFoundError: No module named 'main'
The main module I use only in the IT file, and it doesn't exist in any transformation of the pipeline.我只在 IT 文件中使用的主模块,它在管道的任何转换中都不存在。 Also, when I run the IT, half of the pipeline transformation runs successfully until it hangs with the provided error
此外,当我运行 IT 时,一半的管道转换成功运行,直到它因提供的错误而挂起
The IT code: IT代码:
from main import run
import argparse
import unittest
import logging
class PipelineIT(unittest.TestCase):
def setUp(self):
self.test_pipeline = TestPipeline(is_integration_test=True)
parser = argparse.ArgumentParser()
self.args, self.beam_args = parser.parse_known_args()
self.pipeline_options = PipelineOptions(self.beam_args)
self.client = get_bq_instance()
self.tables_timestamp = datetime.now().strftime("%Y%m%d%H%M")
def test_mc_end_to_end(self):
state_verifier = PipelineStateMatcher(PipelineState.DONE)
extra_opts = {
'input': IT_BUCKET,
'output_dataset': IT_DATASET,
'output': IT_OUTPUT,
'bq_timestamp': self.tables_timestamp,
'on_success_matcher':
all_of(state_verifier)
}
run(self.test_pipeline.get_full_options_as_args(**extra_opts), save_main_session=True)
# buch of asserts
THe command I'm using to run the IT我用来运行 IT 的命令
coverage run -m pytest --log-cli-level=INFO integration_tests/end_to_end_it_test.py --job_name "end_to_end_it" --test-pipeline-options=" --run_mode=cloud --mode=test --setup_file=path_to_setup.py"
The pipeline works fine in the production mode, but in the testing mode it shows that error.管道在生产模式下工作正常,但在测试模式下显示该错误。 I'm just wondering if the
main
is used only to trigger the integration test from local, how can it breaks the pipeline with the error我只是想知道
main
是否仅用于从本地触发集成测试,它怎么会因错误而中断管道
After deep investigation, in my pipeline, I was using beam.Filter
in the following way:经过深入调查,在我的管道中,我通过以下方式使用了
beam.Filter
:
dropped_and_missing = all_recs | 'Filter Dropped and Missing recs' >> beam.Filter(lambda rec: rec['existing_status'] == 'Dropped' or rec['existing_status'] == 'Missing')
Replacing the code block with a PTransformation
that is based on if
conditions solved the issue.将代码块替换为基于条件
if
解决了问题的PTransformation
。
I don't know where the issue is, I tried to dig into the source code, checking if there is in main module in the Filter
function, but it doesn't exist.我不知道问题出在哪里,我试图挖掘源代码,检查
Filter
function的主模块中是否存在,但它不存在。
Also what's suspicious is the error is occurred only when running the integration test from the command line.另外值得怀疑的是,仅在从命令行运行集成测试时才会发生错误。 Pipeline works fine with
LocalRunner
and DataflowRunner
管道与
LocalRunner
和DataflowRunner
配合得很好
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.