简体   繁体   中英

Apache Beam Dataflow BigQuery

How can I get the list of tables from a Google BigQuery dataset using apache beam with DataflowRunner?

I can't find how to get tables from a specified dataset. I want to migrate tables from a dataset located in US to one in EU using Dataflow's parallel processing programming model.

Declare library

from google.cloud import bigquery

Prepares a bigquery client

client = bigquery.Client(project='your_project_name')

Prepares a reference to the new dataset

dataset_ref = client.dataset('your_data_set_name')

Make API request

tables = list(client.list_tables(dataset_ref))
if tables:
    for table in tables:
        print('\t{}'.format(table.table_id))

Reference: https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/usage.html#datasets

You can try using google-cloud-examples maven repo. There's a class by the name BigQuerySnippets that makes a API call to get the table meta and you can fetch the the schema. Please note that the limit API quota is 6 maximum concurrent requests per second.

The purpose of Dataflow is to create pipelines, so the ability to make some API requests is not included. You have to use the BigQuery Java Client Library to get the data and then provide it to your Apache Pipeline.

DatasetId datasetId = DatasetId.of(projectId, datasetName);
Page<Table> tables = bigquery.listTables(datasetId, TableListOption.pageSize(100));
for (Table table : tables.iterateAll()) {
  // do something
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM