简体   繁体   English

Dataprep 将数据集/表留在 BigQuery 中

[英]Dataprep is leaving Datasets/Tables behind in BigQuery

I am using Google Cloud Dataprep for processing data stored in BigQuery.我正在使用 Google Cloud Dataprep 处理存储在 BigQuery 中的数据。 I am having an issue with dataprep/dataflow creates a new dataset with a name starting with "temp_dataset_beam_job_"我在使用 dataprep/dataflow 创建一个名称以“temp_dataset_beam_job_”开头的新数据集时遇到问题

It seems to crate the temporary dataset both for failed and successful dataflow jobs, that dataprep creates.它似乎为 dataprep 创建的失败和成功的数据流作业创建了临时数据集。 This is an issue as BigQuery becomes messy very quickly with all these flows.这是一个问题,因为 BigQuery 很快就会在所有这些流程中变得混乱。

This has not been an issue in the past.这在过去不是问题。

A similar issue has been described in this in this GitHub thread: https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/609在此 GitHub 线程中描述了类似的问题: https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/609

Is there any way of not creating temporary datasets, or instead creating them in a Cloud Storage folder?有没有办法不创建临时数据集,或者在 Cloud Storage 文件夹中创建它们?

I wrote a cleanup script that I am running in Cloud Run (see this article) using Cloud Scheduler.我使用 Cloud Scheduler 编写了一个在 Cloud Run(请参阅本文)中运行的清理脚本。

Below is the script:下面是脚本:

#!/bin/bash

PROJECT={PROJECT_NAME}

# get list of datasets with temp_dataset_beam
# optional: write list of files to cloud storage
obj="gs://{BUCKET_NAME}/maintenance-report-$(date +%s).txt"
bq ls --max_results=100 | grep "temp_dataset_beam" | gsutil -q cp -J - "${obj}"

datasets=$(bq ls --max_results=100 | grep "temp_dataset_beam")

for dataset in $datasets
do
  echo $PROJECT:$dataset
  # WARNING: Uncomment the line below to remove datasets
  # bq rm --dataset=true --force=true $PROJECT:$dataset
done

I solved this in Dataprep directly by running a SQL script post data publish that will run after each job.我通过运行 SQL 脚本发布数据发布直接在 Dataprep 中解决了这个问题,该脚本将在每个作业之后运行。 You can set this in Dataprep in the output Manual settings.您可以在 output 手动设置中的 Dataprep 中进行设置。

  (SELECT CONCAT("drop table `<project_id>.",table_schema,".",   table_name, "`;" ) AS value
      FROM <dataset>.INFORMATION_SCHEMA.TABLES -- or region.INFORMATION_SCHEMA.TABLES
      WHERE table_name LIKE "Dataprep_%"
      ORDER BY table_name DESC)
DO
  EXECUTE IMMEDIATE(drop_statement.value);--Here the table is dropped
END FOR;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM