更新后数据流作业保留旧错误 state

Question

When I was submitting my Dataflow job using DataflowRunner (I was using streaming job with Pub/Sub source), I made a mistake when defining the execution parameter of the BQ table name (let's say that the wrong table name is project-A) and the job threw some error.当我使用 DataflowRunner 提交我的 Dataflow 作业时（我使用带有 Pub/Sub 源的流式作业），我在定义 BQ 表名的执行参数时出错（假设错误的表名是 project-A）和这项工作引发了一些错误。 Then I updated the job using --update command with the correct table name, but the job then threw some error again ie the error told me that I am still using project-A as the BQ table name.然后我使用--update 命令用正确的表名更新了作业，但作业又抛出了一些错误，即错误告诉我我仍在使用project-A 作为BQ 表名。

In short, this is what I was doing looks like:简而言之，这就是我正在做的样子：

I submit a Dataflow job我提交了一个数据流作业

python main.py \
 --job_name=dataflow-job1 \
 --runner=DataflowRunner \
 --staging_location=gs://project-B-bucket/staging \
 --temp_location=gs://project-B-bucket/temp \
 --dataset=project-A:table-A

I got errors as project-A:table-A was not the correct dataset我收到错误，因为 project-A:table-A 不是正确的数据集

{
  "error": {
    "code": 403,
    "message": "Access Denied: Dataset project-A:table-A: User does not have bigquery.datasets.get permission for dataset project-A:table-A.",
    "errors": [
      {
        "message": "Access Denied: Dataset project-A:table-A: User does not have bigquery.datasets.get permission for dataset project-A:table-A.",
        "domain": "global",
        "reason": "accessDenied"
      }
    ],
    "status": "PERMISSION_DENIED"
  }
}

I updated the job using --update我使用 --update 更新了作业

python main.py \
 --job_name=dataflow-job1 \
 --runner=DataflowRunner \
 --staging_location=gs://project-B-bucket/staging \
 --temp_location=gs://project-B-bucket/temp \
 --dataset=project-B:table-B \
 --update

Then I got the same error as before (point 2)然后我得到了和以前一样的错误（第2点）

Why it seems like the old state of the job still retained?为什么旧的 state 似乎仍然保留？ I thought that if Dataflow detect error on the job it will not process the pipeline and the Pub/Sub will not be ACKed and the pipeline will re-started.我认为如果 Dataflow 在作业中检测到错误，它将不会处理管道，并且 Pub/Sub 不会被确认并且管道将重新启动。

Update 2020-12-08: This is how I pass the parameter arguments: 2020-12-08 更新：这就是我传递参数 arguments 的方式：

class MyOptions(PipelineOptions):
    @classmethod
    def _add_argparse_args(cls, parser):
        parser.add_argument('--dataset')
...
class WriteToBigQuery(beam.PTransform):
    def __init__(self, name):
         self.name = name

    def expand(self, pcoll):
        return (pcoll
                | 'WriteBQ' >> beam.io.WriteToBigQuery(
            '{0}.my_table'.format(self.name),
            create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER,
            write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND))
def run(argv=None, save_main_session=True):
    pipeline_options = PipelineOptions(flags=argv)
    pipeline_options.view_as(StandardOptions).streaming = True
    my_args = pipeline_options.view_as(MyOptions)
    ...
    with beam.Pipeline(options=pipeline_options) as p:
        ...
        # I wrapped the BQ write component inside a PTransform class
        output | 'WriteBQ' >> WriteToBigQuery(my_args.dataset)

Answer 1

You can't change the pipeline parameters with you update the dataflow streaming jobs.您无法在更新数据流流作业时更改管道参数。 You can only update the transforms of your pipeline.您只能更新管道的转换。

更新后数据流作业保留旧错误 state

问题描述

1 个解决方案

解决方案1
3 已采纳 2020-12-07 11:12:49

更新后数据流作业保留旧错误 state

问题描述

1 个解决方案

解决方案1 3 已采纳 2020-12-07 11:12:49

解决方案1
3 已采纳 2020-12-07 11:12:49