[英]How to schedule an export from a BigQuery table to Cloud Storage?
I have successfully scheduled my query in BigQuery, and the result is saved as a table in my dataset.我已经成功地在 BigQuery 中安排了我的查询,结果在我的数据集中保存为一个表。 I see a lot of information about scheduling data transfer in to BigQuery or Cloud Storage, but I haven't found anything regarding scheduling an export from a BigQuery table to Cloud Storage yet.
我看到很多关于安排将数据传输到BigQuery 或 Cloud Storage 的信息,但我还没有找到任何关于安排从BigQuery 表到 Cloud Storage 的导出的信息。
Is it possible to schedule an export of a BigQuery table to Cloud Storage so that I can further schedule having it SFTP-ed to me via Google BigQuery Data Transfer Services?是否可以安排将 BigQuery 表导出到云存储,以便我可以进一步安排通过 Google BigQuery 数据传输服务将其 SFTP 发送给我?
There isn't a managed service for scheduling BigQuery table exports, but one viable approach is to use Cloud Functions in conjunction with Cloud Scheduler .没有用于安排 BigQuery 表导出的托管服务,但一种可行的方法是将Cloud Functions与Cloud Scheduler结合使用。
The Cloud Function would contain the necessary code to export to Cloud Storage from the BigQuery table. Cloud Function 将包含从 BigQuery 表导出到 Cloud Storage 的必要代码。 There are multiple programming languages to choose from for that, such as Python , Node.JS , and Go .
有多种编程语言可供选择,例如Python 、 Node.JS和Go 。
Cloud Scheduler would send an HTTP call periodically in a cron format to the Cloud Function which would in turn, get triggered and run the export programmatically. Cloud Scheduler 会定期以cron格式向 Cloud Function 发送HTTP调用,Cloud Function 会依次触发并以编程方式运行导出。
As an example and more specifically, you can follow these steps:作为示例,更具体地说,您可以按照以下步骤操作:
Create a Cloud Function using Python with an HTTP trigger . 使用 Python 和HTTP触发器创建云函数。 To interact with BigQuery from within the code you need to use the BigQuery client library .
要从代码中与 BigQuery 交互,您需要使用 BigQuery 客户端库。 Import it with
from google.cloud import bigquery
.使用
from google.cloud import bigquery
导入它。 Then, you can use the following code in main.py to create an export job from BigQuery to Cloud Storage:然后,您可以在main.py 中使用以下代码创建从 BigQuery 到 Cloud Storage 的导出作业:
# Imports the BigQuery client library from google.cloud import bigquery def hello_world(request): # Replace these values according to your project project_name = "YOUR_PROJECT_ID" bucket_name = "YOUR_BUCKET" dataset_name = "YOUR_DATASET" table_name = "YOUR_TABLE" destination_uri = "gs://{}/{}".format(bucket_name, "bq_export.csv.gz") bq_client = bigquery.Client(project=project_name) dataset = bq_client.dataset(dataset_name, project=project_name) table_to_export = dataset.table(table_name) job_config = bigquery.job.ExtractJobConfig() job_config.compression = bigquery.Compression.GZIP extract_job = bq_client.extract_table( table_to_export, destination_uri, # Location must match that of the source table. location="US", job_config=job_config, ) return "Job with ID {} started exporting data from {}.{} to {}".format(extract_job.job_id, dataset_name, table_name, destination_uri)
Specify the client library dependency in the requirements.txt file by adding this line:通过添加以下行在requirements.txt文件中指定客户端库依赖项:
google-cloud-bigquery
Create a Cloud Scheduler job . 创建 Cloud Scheduler 作业。 Set the Frequency you wish for the job to be executed with.
设置您希望执行作业的频率。 For instance, setting it to
0 1 * * 0
would run the job once a week at 1 AM every Sunday morning.例如,将其设置为
0 1 * * 0
将在每周日凌晨 1 点每周运行一次作业。 The crontab tool is pretty useful when it comes to experimenting with cron scheduling. crontab 工具在试验 cron 调度时非常有用。
Choose HTTP as the Target , set the URL as the Cloud Function's URL (it can be found by selecting the Cloud Function and navigating to the Trigger tab), and as HTTP method choose GET
.选择HTTP作为Target ,将URL设置为 Cloud Function 的 URL(可以通过选择 Cloud Function 并导航到 Trigger 选项卡找到),并选择
GET
作为 HTTP 方法。
Once created, and by pressing the RUN NOW button, you can test how the export behaves.创建后,通过按RUN NOW按钮,您可以测试导出的行为。 However, before doing so, make sure the default App Engine service account has at least the Cloud IAM
roles/storage.objectCreator
role, or otherwise the operation might fail with a permission error.但是,在此之前,请确保默认 App Engine 服务帐户至少具有 Cloud IAM
roles/storage.objectCreator
角色,否则操作可能会因权限错误而失败。 The default App Engine service account has a form of YOUR_PROJECT_ID@appspot.gserviceaccount.com
.默认 App Engine 服务帐户的格式为
YOUR_PROJECT_ID@appspot.gserviceaccount.com
。
If you wish to execute exports on different tables, datasets and buckets for each execution, but essentially employing the same Cloud Function, you can use the HTTP POST
method instead, and configure a Body containing said parameters as data, which would be passed on to the Cloud Function - although, that would imply doing some small changes in its code.如果您希望为每次执行在不同的表、数据集和存储桶上执行导出,但本质上使用相同的 Cloud Function,您可以改用 HTTP
POST
方法,并将包含所述参数的Body配置为数据,这些参数将被传递到云函数 - 不过,这意味着对其代码进行一些小的更改。
Lastly, when the job is created, you can use the Cloud Function's returned job ID
and the bq
CLI to view the status of the export job with bq show -j <job_id>
.最后,在创建作业时,您可以使用 Cloud Function 返回的
job ID
和bq
CLI 通过bq show -j <job_id>
查看导出作业的状态。
You have an alternative to the second part of the Maxim answer.您可以替代 Maxim 答案的第二部分。 The code for extracting the table and store it into Cloud Storage should work.
提取表并将其存储到 Cloud Storage 的代码应该可以工作。
But, when you schedule a query, you can also define a PubSub topic where the BigQuery scheduler will post a message when the job is over.但是,当您安排查询时,您还可以定义一个 PubSub 主题,BigQuery 调度程序将在该主题中在作业结束时发布消息。 Thereby, the scheduler set up, as described by Maxim is optional and you can simply plug the function to the PubSub notification.
因此,Maxim 描述的调度程序设置是可选的,您可以简单地将该函数插入 PubSub 通知。
Before performing the extraction, don't forget to check the error status of the pubsub notification .在执行提取之前,不要忘记检查pubsub 通知的错误状态。 You have also a lot of information about the scheduled query;
您还有很多关于预定查询的信息; useful is you want to perform more checks or if you want to generalize the function.
有用的是您想要执行更多检查或者如果您想要概括该功能。
So, another point about the SFTP transfert.所以,关于 SFTP 传输的另一点。 I open sourced a projet for querying BigQuery, build a CSV file and transfert this file to FTP server (sFTP and FTPs aren't supported, because my previous company only used FTP protocol!).
我开源了一个查询 BigQuery 的项目,构建一个 CSV 文件并将这个文件传输到 FTP 服务器(不支持 sFTP 和 FTP,因为我以前的公司只使用 FTP 协议!)。 If your file is smaller than 1.5Gb, I can update my project for adding the SFTP support is you want to use this.
如果您的文件小于 1.5Gb,我可以更新我的项目以添加 SFTP 支持,如果您想使用它。 Let me know
让我知道
Not sure if this was in GA when this question was asked, but at least now there is an option to run an export to Cloud Storage via a regular SQL query.不确定问这个问题时是否在 GA 中,但至少现在有一个选项可以通过常规 SQL 查询运行到云存储的导出。 See the SQL tab in Exporting table data .
请参阅导出表数据中的 SQL 选项卡。
Example:例子:
EXPORT DATA
OPTIONS (
uri = 'gs://bucket/folder/*.csv',
format = 'CSV',
overwrite = true,
header = true,
field_delimiter = ';')
AS (
SELECT field1, field2
FROM mydataset.table1
ORDER BY field1
);
This could as well be trivially setup via a Scheduled Query if you need a periodic export.如果您需要定期导出,也可以通过计划查询轻松设置。 And, of course, you need to make sure the user or service account running this has permissions to read the source datasets and tables and to write to the destination bucket.
当然,您需要确保运行它的用户或服务帐户具有读取源数据集和表以及写入目标存储桶的权限。
Hopefully this is useful for other peeps visiting this question if not for OP:)如果不是 OP,希望这对其他访问此问题的人有用:)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.