简体   繁体   English

数据流作业在 40 秒内失败

[英]Dataflow Job is failing within 40 Seconds

I have a simple google could http trigger function which is responsible for triggering Dataflow runner job that loads data from CSV on Cloud Storage to a BigQuery table.我有一个简单的 google could http trigger function,它负责触发数据流运行器作业,将云存储上的数据从 CSV 加载到 BigQuery 表。

My code looks is given below:-我的代码如下所示:-

import apache_beam as beam
import argparse
from apache_beam.options.pipeline_options import SetupOptions, PipelineOptions

PROJECT = 'proj'
BUCKET='BUCKET'
SCHEMA = 'sr:INTEGER,abv:FLOAT,id:INTEGER,name:STRING,style:STRING,ounces:FLOAT,ibu:STRING,brewery_id:STRING'
DATAFLOW_JOB_NAME = 'jobname'


def execute(request):
    argv = [
      '--project={0}'.format(PROJECT),
      '--job_name={0}'.format(DATAFLOW_JOB_NAME),
      '--staging_location=gs://{0}/staging/'.format(BUCKET),
      '--temp_location=gs://{0}/staging/'.format(BUCKET),
      '--region=europe-west2',
      '--runner=DataflowRunner'
   ]

    #p = beam.Pipeline(argv=argv)
    pipeline_options = PipelineOptions(argv)
    pipeline_options.view_as(SetupOptions).save_main_session = True
    p = beam.Pipeline(options=pipeline_options)
    input = 'gs://{0}/beers.csv'.format(BUCKET)
    print ('step-222')

    (p | 'ReadData' >> beam.io.ReadFromText(input, skip_header_lines =1)
       | 'SplitData' >> beam.Map(lambda x: x.split(','))
       | 'FormatToDict' >> beam.Map(lambda x: {"sr": x[0], "abv": x[1], "ibu": x[2], "id": x[3], "name": x[4], "style": x[5], "brewery_id": x[6], "ounces": x[7]}) 
       | 'WriteToBigQuery' >> beam.io.WriteToBigQuery(
           table='data',
           dataset='sandbox',
           project=PROJECT
           schema=SCHEMA,
           create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
           write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
           ))
    p.run()
    return "success"

Function runs successfully and it also creates a Dataflow instance, but Dataflow instance fails with in 40 seconds without creating Graph-View. Function 成功运行,它还创建了一个 Dataflow 实例,但 Dataflow 实例在 40 秒内失败,没有创建 Graph-View。 在此处输入图像描述 It is giving error:-它给出了错误:- 在此处输入图像描述

As @captainnabla said in his comment, you have to create a su.network and give it as option to your Dataflow job.正如@captainnabla 在他的评论中所说,您必须创建一个su.network并将其作为您的Dataflow作业的选项。

  • Solution 1解决方案 1

In the default VPC of the project, create the su.network for Dataflow在项目的默认VPC中,为Dataflow创建 su.network

If you didn't specified the su.network, usually the project default VPC.network will be used by the Dataflow job.如果您没有指定 su.network,通常 Dataflow 作业将使用项目默认的 VPC.network。 I don't know why this didn't worked in your case (maybe in this case, the default.network taken by the job is outside of the project executing the job).我不知道为什么这在您的情况下不起作用(也许在这种情况下,作业采用的 default.network 在执行作业的项目之外)。

  • Solution 2方案二

Create another VPC for your data pipelines and a su.network for Dataflow为您的数据管道创建另一个VPC ,为数据流创建一个Dataflow

The.network config depends on your team strategy. .network 配置取决于您的团队策略。

In the two solutions, you can pass the su.network as program argument to your Dataflow job:在这两个解决方案中,您可以将Dataflow作为程序参数传递给数据流作业:

--subnetwork=https://www.googleapis.com/compute/v1/projects/{PROJECT_ID}/regions/{REGION}/subnetworks/{SUBNETWORK}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM