Dataprep is leaving Datasets/Tables behind in BigQuery

Question

I am using Google Cloud Dataprep for processing data stored in BigQuery. I am having an issue with dataprep/dataflow creates a new dataset with a name starting with "temp_dataset_beam_job_"

It seems to crate the temporary dataset both for failed and successful dataflow jobs, that dataprep creates. This is an issue as BigQuery becomes messy very quickly with all these flows.

This has not been an issue in the past.

A similar issue has been described in this in this GitHub thread: https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/609

Is there any way of not creating temporary datasets, or instead creating them in a Cloud Storage folder?

Answer 1

I wrote a cleanup script that I am running in Cloud Run (see this article) using Cloud Scheduler.

Below is the script:

#!/bin/bash

PROJECT={PROJECT_NAME}

# get list of datasets with temp_dataset_beam
# optional: write list of files to cloud storage
obj="gs://{BUCKET_NAME}/maintenance-report-$(date +%s).txt"
bq ls --max_results=100 | grep "temp_dataset_beam" | gsutil -q cp -J - "${obj}"

datasets=$(bq ls --max_results=100 | grep "temp_dataset_beam")

for dataset in $datasets
do
  echo $PROJECT:$dataset
  # WARNING: Uncomment the line below to remove datasets
  # bq rm --dataset=true --force=true $PROJECT:$dataset
done

Answer 2

I solved this in Dataprep directly by running a SQL script post data publish that will run after each job. You can set this in Dataprep in the output Manual settings.

  (SELECT CONCAT("drop table `<project_id>.",table_schema,".",   table_name, "`;" ) AS value
      FROM <dataset>.INFORMATION_SCHEMA.TABLES -- or region.INFORMATION_SCHEMA.TABLES
      WHERE table_name LIKE "Dataprep_%"
      ORDER BY table_name DESC)
DO
  EXECUTE IMMEDIATE(drop_statement.value);--Here the table is dropped
END FOR;

Dataprep is leaving Datasets/Tables behind in BigQuery

Question

2 answers

solution1
0 2022-06-15 17:27:21

solution2
0 2022-09-23 10:27:01

Dataprep is leaving Datasets/Tables behind in BigQuery

Question

2 answers

solution1 0 2022-06-15 17:27:21

solution2 0 2022-09-23 10:27:01

solution1
0 2022-06-15 17:27:21

solution2
0 2022-09-23 10:27:01