I'm using a Dataflow job to read from a MS-SQL database and write the results to a Big Query table. The purpose of the Dataflow job is to be able to create tables with different schemas based on whatever query was run. I can't find a way to set up a generic Row Mapper when doing the JDBCIO read, and was hoping there was a standard way to create a row to write to Big Query based on the schema of the rows returned in the JDBCIO.read ResultSet.
I get the following error when I don't include the RowMapper in my query definition:
Exception in thread "main" java.lang.IllegalArgumentException: withRowMapper() is required at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument(Preconditions.java:141)
at org.apache.beam.sdk.io.jdbc.JdbcIO$Read.expand(JdbcIO.java:810)
at org.apache.beam.sdk.io.jdbc.JdbcIO$Read.expand(JdbcIO.java:711)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:548)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:499)
at org.apache.beam.sdk.values.PBegin.apply(PBegin.java:56)
at org.apache.beam.sdk.Pipeline.apply(Pipeline.java:192)
at edu.mayo.mcc.aide.sqaTransfer.SqaTransfer.buildPipeline(SqaTransfer.java:81)
at edu.mayo.mcc.aide.sqaTransfer.SqaTransfer.main(SqaTransfer.java:66)
I am trying to writeandrea based on the following setup:
PCollection<TableRow> results = pipeline
.apply("Connect", JdbcIO.<TableRow>read()
.withDataSourceConfiguration(buildDataSourceConfig(options, URL))
.withQuery(query)
.withRowMapper("WHAT NEEDS TO BE HERE TO CREATE A GENERIC ROW MAPPER"));
results.apply("Write to BQ",
BigQueryIO.writeTableRows()
.to(dataset)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
In the read part you can use TableRow
with JdbcIO
as follow:
PCollection<TableRow> results = pipeline
.apply("Connect", JdbcIO.<TableRow>read()
.withDataSourceConfiguration(buildDataSourceConfig(options, URL))
.withQuery(query)
.withCoder(TableRowJsonCoder.of())
.withRowMapper(new JdbcIO.RowMapper<TableRow>() {
@Override
public TableRow mapRow(ResultSet resultSet) {
// Implements your logic here.
return tableRow;
}
)
);
I think Dataflow
isn't the easier service to meet your need because you have to pass a schema to BigQueryIO
to allows Dataflow
to create the table with CREATE_IF_NEEDED
option.
The schema is passed with the following method: withSchema(schema) , example:
rows.apply(
"Write to BigQuery",
BigQueryIO.writeTableRows()
.to(String.format("%s:%s.%s", project, dataset, table))
.withSchema(schema)
// For CreateDisposition:
// - CREATE_IF_NEEDED (default): creates the table if it doesn't exist, a schema is
// required
// - CREATE_NEVER: raises an error if the table doesn't exist, a schema is not needed
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
// For WriteDisposition:
// - WRITE_EMPTY (default): raises an error if the table is not empty
// - WRITE_APPEND: appends new rows to existing rows
// - WRITE_TRUNCATE: deletes the existing rows before writing
.withWriteDisposition(WriteDisposition.WRITE_TRUNCATE));
// pipeline.run().waitUntilFinish();
}
That means in your case, you have to build the schema from the current TableRow
in the PCollection
.
It's not easy to do that with Beam
. It would have been easier if you could have used a schema for your BigQuery table.
I propose you another solution and approach:
MS-SQL
data to Cloud Storage
Cloud Storage
files to BigQuery
with autodetect
mode to infer the schema from the file. You can add this logic in a Shell
script if you want to automate this process with gcloud
cli.
Example to load GCS
file to BigQuery
:
bq load \
--autodetect \
--replace \
--source_format=CSV \
mydataset.mytable \
gs://mybucket/mydata.csv
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.