简体繁体 English

如何将 BigQuery 视图作为 csv 文件传输到 Google Cloud Storage 存储桶

[英]How to Transfer a BigQuery view to a Google Cloud Storage bucket as a csv file

原文 2022-09-07 19:07:56 1 2 google-cloud-platform/ airflow/ directed-acyclic-graphs

I need to export the content of a BigQuery view to the csv file in GCP, with Airflow DAG.我需要使用 Airflow DAG 将 BigQuery 视图的内容导出到 GCP 中的 csv 文件。 To export the content of the BQ TABLE, I can use BigQueryToCloudStorageOperator.要导出 BQ TABLE 的内容，我可以使用 BigQueryToCloudStorageOperator。 But in my case I need to use an existing view, and BigQueryToCloudStorageOperator fails with this error, which I see while checking the logs for failed DAG:但在我的情况下，我需要使用现有视图，BigQueryToCloudStorageOperator 失败并出现此错误，我在检查失败 DAG 的日志时看到：

BigQuery job failed: my_view is not allowed for this operation because it is currently a VIEW BigQuery 作业失败：my_view 不允许用于此操作，因为它当前是 VIEW

So, what options do I have here?那么，我在这里有什么选择？ I can't use a regular table, so may be there is another operator that would work with a view data stored in BQ, instead of table?我不能使用常规表，所以可能有另一个运算符可以处理存储在 BQ 中的视图数据，而不是表？ Or may be the same operator would work with some addition options (although I don't see anything useful in here Apache documentation for BigQueryToCloudStorageOperator )?或者可能是同一个操作员可以使用一些附加选项（尽管我在这里的 Apache 文档中没有看到任何有用的 BigQueryToCloudStorageOperator ）？

2 个解决方案

I think the Bigquery client doesn't give the possibility to export a view to a GCS file.我认为Bigquery客户端无法将视图导出到GCS文件。

It's not perfect but I propose you 2 solutions它并不完美，但我建议你 2 个解决方案

First solution (more native with existing operators):第一个解决方案（使用现有运营商更原生）：

Create a staging table to export it to GCS创建临时表以将其导出到GCS
At the beginning of your DAG, create a task that truncate this staging table在 DAG 的开头，创建一个truncate此临时表的任务
Add a task with a select on your view and an insert in your staging table ( insert/select )在您的视图上添加一个带有select的任务，并在您的临时表中添加一个insert （ insert/select ）
Use the bigquery_to_gcs operator from your staging table使用临时表中的bigquery_to_gcs运算符

Second solution (less native with Python clients and PythonOperator):第二种解决方案（使用 Python 客户端和 PythonOperator 较少原生）：

Use a PythonOperator使用PythonOperator
In this operator, use a Bigquery Python client to load data from your view as Dict and the storage Python client to generate a file to GCS from this Dict在此运算符中，使用Bigquery Python 客户端将数据从您的视图加载为 Dict 并使用storage Python 客户端从该 Dict 生成文件到GCS

I have a preference for the first solution, even if it forces me to create a staging table.我偏爱第一个解决方案，即使它迫使我创建一个临时表。

I ended up with a kind of combined solution, part of it is what Mazlum Tosun suggested in his answer: in my DAG I added an extra first step, a DataLakeKubernetesPodOperator , which runs a Python file.我最终得到了一种组合解决方案，其中一部分是 Mazlum Tosun 在他的回答中建议的：在我的 DAG 中，我添加了一个额外的第一步，一个DataLakeKubernetesPodOperator ，它运行一个 Python 文件。 In that Python file there are calls to SQL files, which contain simple queries (put in the await asyncio.wait(...) block and executed with bq_execute() ): truncate an existing table (to prepare it for a new data), and then copy (insert) data from the view to the truncated table (as Mazlum Tosun suggested).在该 Python 文件中，有对 SQL 文件的调用，其中包含简单查询（放入await asyncio.wait(...)块并使用 bq_execute() 执行）：截断现有表（为新数据做好准备），然后将数据从视图复制（插入）到截断的表中（如 Mazlum Tosun 建议的那样）。

After that step, the rest is the same as before: I use BigQueryToCloudStorageOperator to copy data from the regular table (which now contains data from the view) to google cloud storage bucket, and now it works fine.在这一步之后，rest 和以前一样：我使用BigQueryToCloudStorageOperator将数据从常规表（现在包含视图中的数据）复制到谷歌云存储桶，现在它工作正常。