简体   繁体   English

如何安排从 BigQuery 表到 Cloud Storage 的导出?

[英]How to schedule an export from a BigQuery table to Cloud Storage?

I have successfully scheduled my query in BigQuery, and the result is saved as a table in my dataset.我已经成功地在 BigQuery 中安排了我的查询,结果在我的数据集中保存为一个表。 I see a lot of information about scheduling data transfer in to BigQuery or Cloud Storage, but I haven't found anything regarding scheduling an export from a BigQuery table to Cloud Storage yet.我看到很多关于安排将数据传输BigQuery 或 Cloud Storage 的信息,但我还没有找到任何关于安排BigQuery 表到 Cloud Storage 的导出的信息。

Is it possible to schedule an export of a BigQuery table to Cloud Storage so that I can further schedule having it SFTP-ed to me via Google BigQuery Data Transfer Services?是否可以安排将 BigQuery 表导出到云存储,以便我可以进一步安排通过 Google BigQuery 数据传输服务将其 SFTP 发送给我?

There isn't a managed service for scheduling BigQuery table exports, but one viable approach is to use Cloud Functions in conjunction with Cloud Scheduler .没有用于安排 BigQuery 表导出的托管服务,但一种可行的方法是将Cloud FunctionsCloud Scheduler结合使用。

The Cloud Function would contain the necessary code to export to Cloud Storage from the BigQuery table. Cloud Function 将包含从 BigQuery 表导出到 Cloud Storage 的必要代码。 There are multiple programming languages to choose from for that, such as Python , Node.JS , and Go .有多种编程语言可供选择,例如PythonNode.JSGo

Cloud Scheduler would send an HTTP call periodically in a cron format to the Cloud Function which would in turn, get triggered and run the export programmatically. Cloud Scheduler 会定期以cron格式向 Cloud Function 发送HTTP调用,Cloud Function 会依次触发并以编程方式运行导出。

As an example and more specifically, you can follow these steps:作为示例,更具体地说,您可以按照以下步骤操作:

  1. Create a Cloud Function using Python with an HTTP trigger . 使用 Python 和HTTP触发器创建云函数 To interact with BigQuery from within the code you need to use the BigQuery client library .要从代码中与 BigQuery 交互,您需要使用 BigQuery 客户端库 Import it with from google.cloud import bigquery .使用from google.cloud import bigquery导入它。 Then, you can use the following code in main.py to create an export job from BigQuery to Cloud Storage:然后,您可以在main.py 中使用以下代码创建从 BigQuery 到 Cloud Storage 的导出作业:

     # Imports the BigQuery client library from google.cloud import bigquery def hello_world(request): # Replace these values according to your project project_name = "YOUR_PROJECT_ID" bucket_name = "YOUR_BUCKET" dataset_name = "YOUR_DATASET" table_name = "YOUR_TABLE" destination_uri = "gs://{}/{}".format(bucket_name, "bq_export.csv.gz") bq_client = bigquery.Client(project=project_name) dataset = bq_client.dataset(dataset_name, project=project_name) table_to_export = dataset.table(table_name) job_config = bigquery.job.ExtractJobConfig() job_config.compression = bigquery.Compression.GZIP extract_job = bq_client.extract_table( table_to_export, destination_uri, # Location must match that of the source table. location="US", job_config=job_config, ) return "Job with ID {} started exporting data from {}.{} to {}".format(extract_job.job_id, dataset_name, table_name, destination_uri)

    Specify the client library dependency in the requirements.txt file by adding this line:通过添加以下行在requirements.txt文件中指定客户端库依赖项:

     google-cloud-bigquery
  2. Create a Cloud Scheduler job . 创建 Cloud Scheduler 作业 Set the Frequency you wish for the job to be executed with.设置您希望执行作业的频率 For instance, setting it to 0 1 * * 0 would run the job once a week at 1 AM every Sunday morning.例如,将其设置为0 1 * * 0将在每周日凌晨 1 点每周运行一次作业。 The crontab tool is pretty useful when it comes to experimenting with cron scheduling. crontab 工具在试验 cron 调度时非常有用。

    Choose HTTP as the Target , set the URL as the Cloud Function's URL (it can be found by selecting the Cloud Function and navigating to the Trigger tab), and as HTTP method choose GET .选择HTTP作为Target ,将URL设置为 Cloud Function 的 URL(可以通过选择 Cloud Function 并导航到 Trigger 选项卡找到),并选择GET作为 HTTP 方法。

    Once created, and by pressing the RUN NOW button, you can test how the export behaves.创建后,通过按RUN NOW按钮,您可以测试导出的行为。 However, before doing so, make sure the default App Engine service account has at least the Cloud IAM roles/storage.objectCreator role, or otherwise the operation might fail with a permission error.但是,在此之前,请确保默认 App Engine 服务帐户至少具有 Cloud IAM roles/storage.objectCreator角色,否则操作可能会因权限错误而失败。 The default App Engine service account has a form of YOUR_PROJECT_ID@appspot.gserviceaccount.com .默认 App Engine 服务帐户的格式为YOUR_PROJECT_ID@appspot.gserviceaccount.com

    If you wish to execute exports on different tables, datasets and buckets for each execution, but essentially employing the same Cloud Function, you can use the HTTP POST method instead, and configure a Body containing said parameters as data, which would be passed on to the Cloud Function - although, that would imply doing some small changes in its code.如果您希望为每次执行在不同的表、数据集和存储桶上执行导出,但本质上使用相同的 Cloud Function,您可以改用 HTTP POST方法,并将包含所述参数的Body配置为数据,这些参数将被传递到云函数 - 不过,这意味着对其代码进行一些小的更改。

Lastly, when the job is created, you can use the Cloud Function's returned job ID and the bq CLI to view the status of the export job with bq show -j <job_id> .最后,在创建作业时,您可以使用 Cloud Function 返回的job IDbq CLI 通过bq show -j <job_id>查看导出作业的状态。

You have an alternative to the second part of the Maxim answer.您可以替代 Maxim 答案的第二部分。 The code for extracting the table and store it into Cloud Storage should work.提取表并将其存储到 Cloud Storage 的代码应该可以工作。

But, when you schedule a query, you can also define a PubSub topic where the BigQuery scheduler will post a message when the job is over.但是,当您安排查询时,您还可以定义一个 PubSub 主题,BigQuery 调度程序在该主题中在作业结束时发布消息。 Thereby, the scheduler set up, as described by Maxim is optional and you can simply plug the function to the PubSub notification.因此,Maxim 描述的调度程序设置是可选的,您可以简单地将该函数插入 PubSub 通知。

Before performing the extraction, don't forget to check the error status of the pubsub notification .在执行提取之前,不要忘记检查pubsub 通知错误状态 You have also a lot of information about the scheduled query;您还有很多关于预定查询的信息; useful is you want to perform more checks or if you want to generalize the function.有用的是您想要执行更多检查或者如果您想要概括该功能。

So, another point about the SFTP transfert.所以,关于 SFTP 传输的另一点。 I open sourced a projet for querying BigQuery, build a CSV file and transfert this file to FTP server (sFTP and FTPs aren't supported, because my previous company only used FTP protocol!).开源了一个查询 BigQuery 的项目,构建一个 CSV 文件并将这个文件传输到 FTP 服务器(不支持 sFTP 和 FTP,因为我以前的公司只使用 FTP 协议!)。 If your file is smaller than 1.5Gb, I can update my project for adding the SFTP support is you want to use this.如果您的文件小于 1.5Gb,我可以更新我的项目以添加 SFTP 支持,如果您想使用它。 Let me know让我知道

Not sure if this was in GA when this question was asked, but at least now there is an option to run an export to Cloud Storage via a regular SQL query.不确定问这个问题时是否在 GA 中,但至少现在有一个选项可以通过常规 SQL 查询运行到云存储的导出。 See the SQL tab in Exporting table data .请参阅导出表数据中的 SQL 选项卡。

Example:例子:

EXPORT DATA
  OPTIONS (
    uri = 'gs://bucket/folder/*.csv',
    format = 'CSV',
    overwrite = true,
    header = true,
    field_delimiter = ';')
AS (
  SELECT field1, field2
  FROM mydataset.table1
  ORDER BY field1
);

This could as well be trivially setup via a Scheduled Query if you need a periodic export.如果您需要定期导出,也可以通过计划查询轻松设置。 And, of course, you need to make sure the user or service account running this has permissions to read the source datasets and tables and to write to the destination bucket.当然,您需要确保运行它的用户或服务帐户具有读取源数据集和表以及写入目标存储桶的权限。

Hopefully this is useful for other peeps visiting this question if not for OP:)如果不是 OP,希望这对其他访问此问题的人有用:)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将表从 Google BigQuery 导出到 Google Cloud Storage 时出现权限错误 - Permissions Error Exporting a table from Google BigQuery to Google Cloud Storage 如何将一个巨大的表作为一个文件从 BigQuery 导出到 Google 云存储桶中 - How to export a huge table from BigQuery into a Google cloud bucket as one file 从 Cloud Storage 上传到 BigQuery - Upload to BigQuery from Cloud Storage 将查询结果保存在 Cloud Storage 中的 BigQuery 表中 - Save the result of a query in a BigQuery Table, in Cloud Storage 将数据从云存储上传到 bigquery - Upload data from cloud storage to bigquery 如何使用 apache airflow schedule google cloud bigquery 存储过程 - how to use apache airflow schedule google cloud bigquery stored procedure 如何使用 BigQuery 仅从上传到云存储中的最后一个文件中查询数据 - How do you query data from only the last file uploaded in cloud storage with BigQuery 将多个文件从 Cloud Storage 导入 BigQuery 的命令 - Command to import multiple files from Cloud Storage into BigQuery 从 Google Cloud Storage 加载 csv 文件时出现 BigQuery 错误 - BigQuery error when loading csv file from Google Cloud Storage Airflow 2:将数据从 BigQuery 传输到 Cloud Storage 时找不到作业 - Airflow 2: Job Not Found when transferring data from BigQuery into Cloud Storage
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM