简体   繁体   English

如何将 BigQuery 视图作为 csv 文件传输到 Google Cloud Storage 存储桶

[英]How to Transfer a BigQuery view to a Google Cloud Storage bucket as a csv file

I need to export the content of a BigQuery view to the csv file in GCP, with Airflow DAG.我需要使用 Airflow DAG 将 BigQuery 视图的内容导出到 GCP 中的 csv 文件。 To export the content of the BQ TABLE, I can use BigQueryToCloudStorageOperator.要导出 BQ TABLE 的内容,我可以使用 BigQueryToCloudStorageOperator。 But in my case I need to use an existing view, and BigQueryToCloudStorageOperator fails with this error, which I see while checking the logs for failed DAG:但在我的情况下,我需要使用现有视图,BigQueryToCloudStorageOperator 失败并出现此错误,我在检查失败 DAG 的日志时看到:

BigQuery job failed: my_view is not allowed for this operation because it is currently a VIEW BigQuery 作业失败:my_view 不允许用于此操作,因为它当前是 VIEW

So, what options do I have here?那么,我在这里有什么选择? I can't use a regular table, so may be there is another operator that would work with a view data stored in BQ, instead of table?我不能使用常规表,所以可能有另一个运算符可以处理存储在 BQ 中的视图数据,而不是表? Or may be the same operator would work with some addition options (although I don't see anything useful in here Apache documentation for BigQueryToCloudStorageOperator )?或者可能是同一个操作员可以使用一些附加选项(尽管我在这里的 Apache 文档中没有看到任何有用的 BigQueryToCloudStorageOperator )?

I think the Bigquery client doesn't give the possibility to export a view to a GCS file.我认为Bigquery客户端无法将视图导出到GCS文件。

It's not perfect but I propose you 2 solutions它并不完美,但我建议你 2 个解决方案

First solution (more native with existing operators):第一个解决方案(使用现有运营商更原生):

  • Create a staging table to export it to GCS创建临时表以将其导出到GCS
  • At the beginning of your DAG, create a task that truncate this staging table在 DAG 的开头,创建一个truncate此临时表的任务
  • Add a task with a select on your view and an insert in your staging table ( insert/select )在您的视图上添加一个带有select的任务,并在您的临时表中添加一个insertinsert/select
  • Use the bigquery_to_gcs operator from your staging table使用临时表中的bigquery_to_gcs运算符

Second solution (less native with Python clients and PythonOperator):第二种解决方案(使用 Python 客户端和 PythonOperator 较少原生):

  • Use a PythonOperator使用PythonOperator
  • In this operator, use a Bigquery Python client to load data from your view as Dict and the storage Python client to generate a file to GCS from this Dict在此运算符中,使用Bigquery Python 客户端将数据从您的视图加载为 Dict 并使用storage Python 客户端从该 Dict 生成文件到GCS

I have a preference for the first solution, even if it forces me to create a staging table.我偏爱第一个解决方案,即使它迫使我创建一个临时表。

I ended up with a kind of combined solution, part of it is what Mazlum Tosun suggested in his answer: in my DAG I added an extra first step, a DataLakeKubernetesPodOperator , which runs a Python file.我最终得到了一种组合解决方案,其中一部分是 Mazlum Tosun 在他的回答中建议的:在我的 DAG 中,我添加了一个额外的第一步,一个DataLakeKubernetesPodOperator ,它运行一个 Python 文件。 In that Python file there are calls to SQL files, which contain simple queries (put in the await asyncio.wait(...) block and executed with bq_execute() ): truncate an existing table (to prepare it for a new data), and then copy (insert) data from the view to the truncated table (as Mazlum Tosun suggested).在该 Python 文件中,有对 SQL 文件的调用,其中包含简单查询(放入await asyncio.wait(...)块并使用 bq_execute() 执行):截断现有表(为新数据做好准备) ,然后将数据从视图复制(插入)到截断的表中(如 Mazlum Tosun 建议的那样)。

After that step, the rest is the same as before: I use BigQueryToCloudStorageOperator to copy data from the regular table (which now contains data from the view) to google cloud storage bucket, and now it works fine.在这一步之后,rest 和以前一样:我使用BigQueryToCloudStorageOperator将数据从常规表(现在包含视图中的数据)复制到谷歌云存储桶,现在它工作正常。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Google Storage // Cloud Function // Python 修改Bucket中的CSV文件 - Google Storage // Cloud Function // Python Modify CSV file in the Bucket 从 Google Cloud Storage 加载 csv 文件时出现 BigQuery 错误 - BigQuery error when loading csv file from Google Cloud Storage Google Cloud Storage Bucket 无法传入数据 - Google Cloud Storage Bucket can't transfer data in 如何将一个巨大的表作为一个文件从 BigQuery 导出到 Google 云存储桶中 - How to export a huge table from BigQuery into a Google cloud bucket as one file 将文件从 /tmp 文件夹移动到 Google Cloud Storage 存储桶 - Move file from /tmp folder to Google Cloud Storage bucket 无法在浏览器中从 JS 将文件 PUT 到谷歌云存储桶 - Can not PUT file to google cloud storage bucket from JS in browser 如何正确初始化谷歌云存储和存储桶和/或初始化 Multer - How to properly initialize google cloud storage and bucket and/or intialize Multer 如何防止删除 Google Cloud Storage Bucket 中的某个文件夹? - How to prevent the remove of a certain folder in a Google Cloud Storage Bucket? 如何从谷歌云存储桶中获取项目名称/ID? - How to get project name/id from Google Cloud Storage bucket? 如何将查询结果直接写入谷歌云存储桶? - How to write query result to Google Cloud Storage bucket directly?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM