简体   繁体   English

使用通配符选项从 BigQuery 导出到 GCS 时,有没有办法从大表中获取生成的文件列表?

[英]Is there a way to get a list of the files that were generated, from a large table, when exporting from BigQuery to GCS using a wildcard option?

I used the wildcard * export in order to export a large BigQuery table into separate files in GCS.我使用通配符 * export 将大型 BigQuery 表导出到 GCS 中的单独文件中。 I used the code sample provided in GCP's docs:我使用了 GCP 文档中提供的代码示例:

from google.cloud import bigquery
client = bigquery.Client()
bucket_name = 'bucket'
project = "project"
dataset_id = "dataset"
table_id = "table"


destination_uri = "gs://{}/{}".format(bucket_name, "table*.parquet")
dataset_ref = bigquery.DatasetReference(project, dataset_id)
table_ref = dataset_ref.table(table_id)

extract_job = client.extract_table(
    table_ref,
    destination_uri,
    # Location must match that of the source table.
    location="US",
)  # API request
extract_job.result()  # Waits for job to complete.

print(
    "Exported {}:{}.{} to {}".format(project, dataset_id, table_id, destination_uri)
)

This generated 19 different files in my storage bucket like this mytable000000000000.parquet and mytable000000000001.parquet and so on (up to 0000000000019).这在我的存储桶中生成了 19 个不同的文件,例如mytable000000000000.parquetmytable000000000001.parquet等等(最多 0000000000019)。

It would be nice to have an automatic way to get a list of these file names so that I can either compose them together or loop over them to do something else.最好有一种自动获取这些文件名列表的方法,这样我就可以将它们compose在一起或循环处理它们以执行其他操作。 Is there an easy way to edit the code above to do this?有没有一种简单的方法来编辑上面的代码来做到这一点?

You don't get an explicit list when using a wildcard, but take a look at the destinationUriFileCounts field in the extract job statistics .使用通配符时您不会获得明确的列表,但请查看extract job statistics中的destinationUriFileCounts字段。 It would tell you how many files are present.它会告诉你有多少文件存在。 In python, this is available here .在 python 中,可在此处获得。

If you want stronger validation, you could also leverage the Cloud Storage libraries and list objects with the same pattern(s) you supplied as part of the extract configuration.如果您想要更强的验证,您还可以利用 Cloud Storage 库并列出具有您作为提取配置的一部分提供的相同模式的对象。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将 google bigquery 导出到 GCS 时会产生多个 0 字节的文件 - when exporting google bigquery to GCS results in multiple files of 0 bytes 从 GCS 导入 csv 文件并使用 Dataflow 进行转换,然后使用 Airflow 传感器接收到 BigQuery - Import csv files from GCS and transform using Dataflow then sink to BigQuery using Airflow sensors 将 csv 从 GCS 上传到 BigQuery 时,有没有办法提供架构或自动检测架构? - Is there a way to provide schema or auto-detect schema when uploading csv from GCS to BigQuery? 将 BigQuery 表从一个项目导出到另一个项目 - Exporting BigQuery table from one project to another 从 GCS 获取文件并根据文件名模式加载到 bigquery - Get file from GCS and load to bigquery based on file name pattern 使用 Cloud Functions 中的服务帐户将数据从 GCS 上传到 Bigquery - Upload data from GCS to Bigquery using service account in Cloud Functions 使用 BigQuery 从 GCS 读取数据失败并显示“未找到”,但日期(文件)存在 - Reading data from GCS with BigQuery fails with "Not Found", but the date (files) exists 如何从 GCS 将数据加载到 BigQuery(使用 load_table_from_uri 或 load_table_from_dataframe)而不复制 BQ 表中的现有数据 - how to load data into BigQuery from GCS (using load_table_from_uri or load_table_from_dataframe) without duplicating existing data in the BQ table 无法从 Google Cloud SQL 导出到 GCS - Fail exporting from Google Cloud SQL to GCS 从 GCS 加载_Csv_data 到 Bigquery - Load_Csv_data from GCS to Bigquery
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM