[英]How to Transfer a BigQuery view to a Google Cloud Storage bucket as a csv file
I need to export the content of a BigQuery view to the csv file in GCP, with Airflow DAG.我需要使用 Airflow DAG 将 BigQuery 视图的内容导出到 GCP 中的 csv 文件。 To export the content of the BQ TABLE, I can use BigQueryToCloudStorageOperator.
要导出 BQ TABLE 的内容,我可以使用 BigQueryToCloudStorageOperator。 But in my case I need to use an existing view, and BigQueryToCloudStorageOperator fails with this error, which I see while checking the logs for failed DAG:
但在我的情况下,我需要使用现有视图,BigQueryToCloudStorageOperator 失败并出现此错误,我在检查失败 DAG 的日志时看到:
BigQuery job failed: my_view is not allowed for this operation because it is currently a VIEW
BigQuery 作业失败:my_view 不允许用于此操作,因为它当前是 VIEW
So, what options do I have here?那么,我在这里有什么选择? I can't use a regular table, so may be there is another operator that would work with a view data stored in BQ, instead of table?
我不能使用常规表,所以可能有另一个运算符可以处理存储在 BQ 中的视图数据,而不是表? Or may be the same operator would work with some addition options (although I don't see anything useful in here Apache documentation for BigQueryToCloudStorageOperator )?
或者可能是同一个操作员可以使用一些附加选项(尽管我在这里的 Apache 文档中没有看到任何有用的 BigQueryToCloudStorageOperator )?
I think the Bigquery
client doesn't give the possibility to export a view to a GCS
file.我认为
Bigquery
客户端无法将视图导出到GCS
文件。
It's not perfect but I propose you 2 solutions它并不完美,但我建议你 2 个解决方案
First solution (more native with existing operators):第一个解决方案(使用现有运营商更原生):
GCS
GCS
truncate
this staging tabletruncate
此临时表的任务select
on your view and an insert
in your staging table ( insert/select
)select
的任务,并在您的临时表中添加一个insert
( insert/select
)bigquery_to_gcs
operator from your staging tablebigquery_to_gcs
运算符Second solution (less native with Python clients and PythonOperator):第二种解决方案(使用 Python 客户端和 PythonOperator 较少原生):
PythonOperator
PythonOperator
Bigquery
Python client to load data from your view as Dict and the storage
Python client to generate a file to GCS
from this DictBigquery
Python 客户端将数据从您的视图加载为 Dict 并使用storage
Python 客户端从该 Dict 生成文件到GCS
I have a preference for the first solution, even if it forces me to create a staging table.我偏爱第一个解决方案,即使它迫使我创建一个临时表。
I ended up with a kind of combined solution, part of it is what Mazlum Tosun suggested in his answer: in my DAG I added an extra first step, a DataLakeKubernetesPodOperator
, which runs a Python file.我最终得到了一种组合解决方案,其中一部分是 Mazlum Tosun 在他的回答中建议的:在我的 DAG 中,我添加了一个额外的第一步,一个
DataLakeKubernetesPodOperator
,它运行一个 Python 文件。 In that Python file there are calls to SQL files, which contain simple queries (put in the await asyncio.wait(...)
block and executed with bq_execute() ): truncate an existing table (to prepare it for a new data), and then copy (insert) data from the view to the truncated table (as Mazlum Tosun suggested).在该 Python 文件中,有对 SQL 文件的调用,其中包含简单查询(放入
await asyncio.wait(...)
块并使用 bq_execute() 执行):截断现有表(为新数据做好准备) ,然后将数据从视图复制(插入)到截断的表中(如 Mazlum Tosun 建议的那样)。
After that step, the rest is the same as before: I use BigQueryToCloudStorageOperator
to copy data from the regular table (which now contains data from the view) to google cloud storage bucket, and now it works fine.在这一步之后,rest 和以前一样:我使用
BigQueryToCloudStorageOperator
将数据从常规表(现在包含视图中的数据)复制到谷歌云存储桶,现在它工作正常。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.