简体   繁体   中英

Dataflow JDBCIO Read with Generic Row Mapper

I'm using a Dataflow job to read from a MS-SQL database and write the results to a Big Query table. The purpose of the Dataflow job is to be able to create tables with different schemas based on whatever query was run. I can't find a way to set up a generic Row Mapper when doing the JDBCIO read, and was hoping there was a standard way to create a row to write to Big Query based on the schema of the rows returned in the JDBCIO.read ResultSet.

I get the following error when I don't include the RowMapper in my query definition:

Exception in thread "main" java.lang.IllegalArgumentException: withRowMapper() is required  at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument(Preconditions.java:141)
    at org.apache.beam.sdk.io.jdbc.JdbcIO$Read.expand(JdbcIO.java:810)
    at org.apache.beam.sdk.io.jdbc.JdbcIO$Read.expand(JdbcIO.java:711)
    at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:548)
    at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:499)
    at org.apache.beam.sdk.values.PBegin.apply(PBegin.java:56)
    at org.apache.beam.sdk.Pipeline.apply(Pipeline.java:192)
    at edu.mayo.mcc.aide.sqaTransfer.SqaTransfer.buildPipeline(SqaTransfer.java:81)
    at edu.mayo.mcc.aide.sqaTransfer.SqaTransfer.main(SqaTransfer.java:66)

I am trying to writeandrea based on the following setup:

PCollection<TableRow> results = pipeline
        .apply("Connect", JdbcIO.<TableRow>read()
                .withDataSourceConfiguration(buildDataSourceConfig(options, URL))
                .withQuery(query)
                .withRowMapper("WHAT NEEDS TO BE HERE TO CREATE A GENERIC ROW MAPPER"));

results.apply("Write to BQ",
        BigQueryIO.writeTableRows()
                .to(dataset)
                .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
                .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));

In the read part you can use TableRow with JdbcIO as follow:

PCollection<TableRow> results = pipeline
        .apply("Connect", JdbcIO.<TableRow>read()
                .withDataSourceConfiguration(buildDataSourceConfig(options, URL))
                .withQuery(query)
                .withCoder(TableRowJsonCoder.of())
                .withRowMapper(new JdbcIO.RowMapper<TableRow>() {
                    @Override
                    public TableRow mapRow(ResultSet resultSet) {
                        // Implements your logic here.
                      
                        return tableRow;
                    }
                )
        );

I think Dataflow isn't the easier service to meet your need because you have to pass a schema to BigQueryIO to allows Dataflow to create the table with CREATE_IF_NEEDED option.

The schema is passed with the following method: withSchema(schema) , example:

  rows.apply(
        "Write to BigQuery",
        BigQueryIO.writeTableRows()
            .to(String.format("%s:%s.%s", project, dataset, table))
            .withSchema(schema)
            // For CreateDisposition:
            // - CREATE_IF_NEEDED (default): creates the table if it doesn't exist, a schema is
            // required
            // - CREATE_NEVER: raises an error if the table doesn't exist, a schema is not needed
            .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
            // For WriteDisposition:
            // - WRITE_EMPTY (default): raises an error if the table is not empty
            // - WRITE_APPEND: appends new rows to existing rows
            // - WRITE_TRUNCATE: deletes the existing rows before writing
            .withWriteDisposition(WriteDisposition.WRITE_TRUNCATE));

    // pipeline.run().waitUntilFinish();
  }

That means in your case, you have to build the schema from the current TableRow in the PCollection .

It's not easy to do that with Beam . It would have been easier if you could have used a schema for your BigQuery table.

I propose you another solution and approach:

  • Export the MS-SQL data to Cloud Storage
  • Import the Cloud Storage files to BigQuery with autodetect mode to infer the schema from the file.

You can add this logic in a Shell script if you want to automate this process with gcloud cli.

Example to load GCS file to BigQuery :

 bq load \
    --autodetect \
    --replace \
    --source_format=CSV \
    mydataset.mytable \
    gs://mybucket/mydata.csv

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM