简体   繁体   English

Vertex 工作台 - 如何在 Jupyter 笔记本中运行 BigQueryExampleGen

[英]Vertex workbench - how to run BigQueryExampleGen in Jupyter notebook

Problem问题

Tried to run BigQueryExampleGen in the 尝试

InvalidUserInputError: Request missing required parameter projectId [while running 'InputToRecord/QueryTable/ReadFromBigQuery/Read/SDFBoundedSourceReader/ParDo(SDFBoundedSourceDoFn)/SplitAndSizeRestriction']

Steps脚步

BigQueryExampleGen Setup the GCP project and the interactive TFX context. BigQueryExampleGen设置 GCP 项目和交互式 TFX 上下文。

import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "path_to_credential_file"


from tfx.v1.extensions.google_cloud_big_query import BigQueryExampleGen
from tfx.v1.components import (
    StatisticsGen,
    SchemaGen,
)
from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext
%load_ext tfx.orchestration.experimental.interactive.notebook_extensions.skip
context = InteractiveContext(pipeline_root='./data/artifacts')

Run the BigqueryExampleGen.运行 BigqueryExampleGen。

query = """
SELECT 
    * EXCEPT (trip_start_timestamp, ML_use)
FROM 
    {PROJECT_ID}.public_dataset.chicago_taxitrips_prep
""".format(PROJECT_ID=PROJECT_ID)

example_gen = context.run(
    BigQueryExampleGen(query=query)
)

Got the error.得到错误。

InvalidUserInputError: Request missing required parameter projectId [while running 'InputToRecord/QueryTable/ReadFromBigQuery/Read/SDFBoundedSourceReader/ParDo(SDFBoundedSourceDoFn)/SplitAndSizeRestriction']

Data数据

See mlops-with-vertex-ai/01-dataset-management.ipynb to setup the BigQuery dataset for CThe Chicago Taxi Trips dataset.请参阅mlops-with-vertex-ai/01-dataset-management.ipynb为 CThe Chicago Taxi Trips 数据集设置 BigQuery 数据集。

Project ID项目编号

To run in GCP, need to provide the project ID via beam_pipeline_args argument.要在 GCP 中运行,需要通过beam_pipeline_args参数提供项目 ID。

have proposed #888 to make this work.已经提出#888 来完成这项工作。 With that change, you would be able to do有了这个改变,你就可以做到

context.run(..., beam_pipeline_args=['--project', 'my-project'])
query = """
SELECT 
    * EXCEPT (trip_start_timestamp, ML_use)
FROM 
    {PROJECT_ID}.public_dataset.chicago_taxitrips_prep
""".format(PROJECT_ID=PROJECT_ID)

example_gen = context.run(
    BigQueryExampleGen(query=query),
    beam_pipeline_args=[
        '--project', PROJECT_ID,
    ]
)

However, it still fails with another error.但是,它仍然失败并出现另一个错误。

ValueError: ReadFromBigQuery requires a GCS location to be provided. Neither gcs_location in the constructor nor the fallback option --temp_location is set. [while running 'InputToRecord/QueryTable/ReadFromBigQuery/Read/SDFBoundedSourceReader/ParDo(SDFBoundedSourceDoFn)/SplitAndSizeRestriction']

GCS Bucket GCS 桶

It looks inside GCP, the interactive context runs the BigQueryExampleGen via Dataflow, hence need to provide a GCS bucket URL via the beam_pipeline_args argument.它查看 GCP 内部,交互式上下文通过 Dataflow 运行 BigQueryExampleGen,因此需要通过beam_pipeline_args参数提供 GCS 存储桶 URL。

When running your Dataflow pipeline pass the argument --temp_location gs://bucket/subfolder/运行 Dataflow 管道时,传递参数 --temp_location gs://bucket/subfolder/

query = """
SELECT 
    * EXCEPT (trip_start_timestamp, ML_use)
FROM 
    {PROJECT_ID}.public_dataset.chicago_taxitrips_prep
""".format(PROJECT_ID=PROJECT_ID)

example_gen = context.run(
    BigQueryExampleGen(query=query),
    beam_pipeline_args=[
        '--project', PROJECT_ID,
        '--temp_location', BUCKET
    ]
)
statistics_gen = context.run(
    StatisticsGen(examples=example_gen.component.outputs['examples'])
)
context.show(statistics_gen.component.outputs['statistics'])

schema_gen = SchemaGen(
    statistics=statistics_gen.component.outputs['statistics'],
    infer_feature_shape=True
)
context.run(schema_gen)
context.show(schema_gen.outputs['schema'])

在此处输入图像描述

Documentation文档

This notebook-based tutorial will use Google Cloud BigQuery as a data source to train an ML model.此基于笔记本的教程将使用 Google Cloud BigQuery 作为数据源来训练 ML 模型。 The ML pipeline will be constructed using TFX and run on Google Cloud Vertex Pipelines. ML 管道将使用 TFX 构建并在 Google Cloud Vertex Pipelines 上运行。 In this tutorial, we will use the BigQueryExampleGen component which reads data from BigQuery to TFX pipelines.在本教程中,我们将使用 BigQueryExampleGen 组件,该组件将数据从 BigQuery 读取到 TFX 管道。

We also need to pass beam_pipeline_args for the BigQueryExampleGen.我们还需要为 BigQueryExampleGen 传递 beam_pipeline_args It includes configs like the name of the GCP project and the temporary storage for the BigQuery execution.它包括GCP 项目的名称和 BigQuery 执行的临时存储等配置。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM