BigQueryIO read get TableSchema

Question

What I want to do is read an existing table and generate a new table which has the same schema as the original table plus a few extra column (computed from some columns of the original table). The original table schema can be increased without notice to me (the fields I am using in my dataflow job won't change), so I would like to always read the schema instead of defining some custom class which contains the schema.

In Dataflow SDK 1.x, I can get the TableSchema via

final DataflowPipelineOptions options = ...
final String projectId = ...
final String dataset = ...
final String table = ...

final TableSchema schema = new BigQueryServicesImpl()
    .getDatasetService(options)
    .getTable(projectId, dataset, table)
    .getSchema();

For Dataflow SDK 2.x, BigQueryServicesImpl has become a package-private class.

I read the responses in Get TableSchema from BigQuery result PCollection<TableRow> but I'd prefer not to make a separate query to BigQuery. As that response is now almost 2 years old, are there other thoughts or ideas from the SO community?

Answer 1

Due to how BigQueryI/O is setup now. It needs to query the table schema before the pipleine begins to run. This is a good feature idea, but its not feasible in a single pipeline. In the example you linked the table schema is queries before running the pipeline.

If new columns are added, then unfortunately a new pipeline must be relaunched.

BigQueryIO read get TableSchema

Question

1 answers

solution1
1 2018-05-09 20:18:04

BigQueryIO read get TableSchema

Question

1 answers

solution1 1 2018-05-09 20:18:04

solution1
1 2018-05-09 20:18:04