简体   繁体   中英

Dataprep is leaving Datasets/Tables behind in BigQuery

I am using Google Cloud Dataprep for processing data stored in BigQuery. I am having an issue with dataprep/dataflow creates a new dataset with a name starting with "temp_dataset_beam_job_"

It seems to crate the temporary dataset both for failed and successful dataflow jobs, that dataprep creates. This is an issue as BigQuery becomes messy very quickly with all these flows.

This has not been an issue in the past.

A similar issue has been described in this in this GitHub thread: https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/609

Is there any way of not creating temporary datasets, or instead creating them in a Cloud Storage folder?

I wrote a cleanup script that I am running in Cloud Run (see this article) using Cloud Scheduler.

Below is the script:

#!/bin/bash

PROJECT={PROJECT_NAME}

# get list of datasets with temp_dataset_beam
# optional: write list of files to cloud storage
obj="gs://{BUCKET_NAME}/maintenance-report-$(date +%s).txt"
bq ls --max_results=100 | grep "temp_dataset_beam" | gsutil -q cp -J - "${obj}"

datasets=$(bq ls --max_results=100 | grep "temp_dataset_beam")

for dataset in $datasets
do
  echo $PROJECT:$dataset
  # WARNING: Uncomment the line below to remove datasets
  # bq rm --dataset=true --force=true $PROJECT:$dataset
done

I solved this in Dataprep directly by running a SQL script post data publish that will run after each job. You can set this in Dataprep in the output Manual settings.

  (SELECT CONCAT("drop table `<project_id>.",table_schema,".",   table_name, "`;" ) AS value
      FROM <dataset>.INFORMATION_SCHEMA.TABLES -- or region.INFORMATION_SCHEMA.TABLES
      WHERE table_name LIKE "Dataprep_%"
      ORDER BY table_name DESC)
DO
  EXECUTE IMMEDIATE(drop_statement.value);--Here the table is dropped
END FOR;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM