遇到 assertionError：在使用 Beam/Dataflow 无限期等待后，作业没有达到终端状态

Question

I am trying to use apache beam and dataflow to speed up some data processing, but it encounters:我正在尝试使用 apache Beam 和 dataflow 来加速一些数据处理，但它遇到了：

'Job did not reach to a terminal state after waiting indefinitely.') AssertionError: Job did not reach to a terminal state after waiting indefinitely.

I have simplified my pipeline for testing, but still getting the error( although I could run successfully locally use DirectRunner , so I figure it should be some naive setup issue or a bug in beam/dataflow? Also, I looked up and there is another issue would give similar error that is caused by reading large amount of data from google storage, which is likely already fixed. I don't think my case relates to that as my minimal code does not past test. Below are my minimal code (long argparse code are kept as I suspect error might relate to them?):我已经简化了我的测试管道，但仍然出现错误（尽管我可以在本地成功运行 DirectRunner ，所以我认为这应该是一些幼稚的设置问题或光束/数据流中的错误？另外，我查了一下，还有另一个问题会给出类似的错误，这是由从谷歌存储中读取大量数据引起的，这可能已经修复。我认为我的情况与此无关，因为我的最小代码没有通过测试。下面是我的最小代码（长保留 argparse 代码，因为我怀疑错误可能与它们有关？）：

import os
import argparse
import apache_beam as beam
import logging

def run(argv=None, save_main_session=True) -> None:
    parser = argparse.ArgumentParser()
    parser.add_argument('--given_landmarks', default=False, type=bool,
                        help="Whether to use pre-selected landmark objects")
    parser.add_argument('--hmm_type', default='path_specific', type=str, choices=['path_specific', 'hard_em',  'random'],
                        help='The HMM type. Currently Path-specific, Hard EM, and Random are available.')
    parser.add_argument('--magnitude_normalization', default='normal', type=str,choices=['gamma', 'normal'],
                        help="Distribution type for calculating probability of magnitude for Observer.")
    parser.add_argument('--instruction_type', default='full', type=str,
                        choices=['full', 'object_only', 'direction_only',
                                'mask_object', 'mask_direction'],
                        help='Toggle for full/object-only/direction-only instructions.')
    parser.add_argument('--num_instructions', default=1, type=int,
                        help="The number of instructions to generate per path")
    parser.add_argument('--mp3d_dir', default='/path/to/matterport_data/', type=str,
                        help='Path to Room-to-Room scan data.')
    parser.add_argument('--path_input_dir', default=None, type=str,
                        help='Path to Room-to-Room JSON data.')

    parser.add_argument('--dataset', default=None, type=str, choices=[
                        'R2R', 'R4R', 'RxR'], help='Data source.')

    parser.add_argument(
        '--file_identifier', default='val_seen', type=str,
        help='Source JSON file identifier for Crafty instruction creation.')

    parser.add_argument('--output_file', default=None, type=str,
                    help='Output file to save generated instructions.')

    parser.add_argument(
        '--appraiser_file', type=str,
        default='./crafty.object_idfs.r2r_train.txt',
        help='File to read appraiser information from.')

    parser.add_argument(
        '--full_train_file_path', default=None, type=str,
        help='Path to full training file, for EM training covering all partitions.')

    args, pipeline_args = parser.parse_known_args()
    print(args)
    if not os.path.exists(args.output_file):
        os.makedirs(args.output_file)

    def pipeline(root):

        logging.info('Starting Beam pipeline.')
        outputs = (
        root
        | 'create_input_1' >> beam.Create([1,2,3,4,5])
        | 'map' >> beam.Map(lambda x: (x, 1))
        )
        outputs | beam.Map(print)

    pipeline_options = beam.options.pipeline_options.PipelineOptions(pipeline_args)
    # pipeline_options = beam.options.pipeline_options.PipelineOptions()
    # pipeline_options.view_as(beam.options.pipeline_options.SetupOptions).save_main_session = save_main_session
    # pipeline_options.view_as(beam.options.pipeline_options.DirectOptions
      # ).direct_num_workers = os.cpu_count()
    #pipeline_options.view_as(beam.options.pipeline_options.DirectOptions).direct_running_mode = "multi-processing"

    with beam.Pipeline(options=pipeline_options) as root:
        pipeline(root)


if __name__ == '__main__':
    run()

And my command follows from here :我的命令从这里开始：

 python test.py  \                                                                                                                                                                                                                                                               
    --path_input_dir gs://somepath \
    --dataset somename  \
    --mp3d_dir gs://somepath  \
    --file_identifier someid  \
    --output_file gs://some/other/path  \
    --num_instructions 1 \
    --region us-east1 \
    --runner DataflowRunner \
    --project someproject-id \
    --temp_location gs://someloc

Thanks for any comments or suggestions!感谢您的任何意见或建议！

Answer 1

Not a perfect answer, but this error message indicates that the thread watching and waiting for your job to finish was terminated even though the job was not completed, even though you did not specify a maximum time to wait.不是一个完美的答案，但此错误消息表明即使作业未完成，即使您没有指定等待的最长时间，观察和等待您的作业完成的线程也已终止。 It could have died for a variety of reasons.它可能因各种原因而死亡。

The error occurs here in the Beam codebase, for reference.错误出现在Beam 代码库中，供参考。

Answer 2

Did you check the logs?你检查过日志吗？ It may be a permissions issue.可能是权限问题。 I received the same error, and in the job logs I had this message:我收到了同样的错误，在作业日志中我收到了这条消息：

Workflow failed.工作流失败。 Causes: Permissions verification for controller service account failed.原因：控制器服务帐户的权限验证失败。 All permissions in IAM role roles/dataflow.worker should be granted to controller service account XXXXXXXXXXXXX-compute@developer.gserviceaccount.com. IAM 角色角色/dataflow.worker 中的所有权限应授予控制器服务帐户 XXXXXXXXXXXXX-compute@developer.gserviceaccount.com。

遇到 assertionError：在使用 Beam/Dataflow 无限期等待后，作业没有达到终端状态

问题描述

2 个解决方案

解决方案1
1 2021-06-24 18:10:04

解决方案2
-1 2022-07-11 19:26:37

遇到 assertionError：在使用 Beam/Dataflow 无限期等待后，作业没有达到终端状态

问题描述

2 个解决方案

解决方案1 1 2021-06-24 18:10:04

解决方案2 -1 2022-07-11 19:26:37

解决方案1
1 2021-06-24 18:10:04

解决方案2
-1 2022-07-11 19:26:37