How to pass Table schema from Source to Sink using Apache Beam?

Question

I have one use case where I need to load thousands of tables from Oracle to BiQuery using Apache Beam (DataFlow). I have written the below code that is working by creating tables manually and using CreateDisposition.CREATE_NEVER but that will not be feasible to create all tables manually. So I have written code to fetch schema from Source ( JdbcIO ) and pass it to BigQuery writeTableRows() .

But the code is giving the below error.

Exception in thread "main" java.lang.IllegalArgumentException: schema can not be null
        at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument(Preconditions.java:141)
        at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.withSchema(BigQueryIO.java:2256)
        at org.example.Main.main(Main.java:109)

Code

package org.example;

import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.io.FileIO;
import org.apache.beam.sdk.io.gcp.bigquery.TableRowJsonCoder;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.google.api.services.bigquery.model.TableFieldSchema;
import com.google.api.services.bigquery.model.TableRow;
import com.google.api.services.bigquery.model.TableSchema;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.jdbc.JdbcIO;
import org.apache.beam.sdk.coders.StringUtf8Coder;
import java.sql.ResultSet;
import java.sql.ResultSetMetaData;
import java.sql.SQLException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;


public class Main {
    private static final Logger LOG = LoggerFactory.getLogger(Main.class);
    public static TableSchema schema;

    public static void main(String[] args) {
     

        // Read from JDBC
        Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().create());

        String query2= "select * from Test.emptable";
        PCollection<TableRow> rows = p.apply(JdbcIO.<TableRow>read()
                .withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(
                                "oracle.jdbc.OracleDriver", "jdbc:oracle:thin:@//localhost:1521/ORCL")
                        .withUsername("root")
                        .withPassword("password"))
                .withQuery(query2)
                .withCoder(TableRowJsonCoder.of())
                .withRowMapper(new JdbcIO.RowMapper<TableRow>() {
                    @Override
                    public TableRow mapRow(ResultSet resultSet) throws Exception {
                        schema = getSchemaFromResultSet(resultSet);
                        TableRow tableRow = new TableRow();

                        List<TableFieldSchema> columnNames = schema.getFields();
                        for(int i =1; i<= resultSet.getMetaData().getColumnCount(); i++) {
                            
                            tableRow.put(columnNames.get(i-1).get("name").toString(), String.valueOf(resultSet.getObject(i)));
                        }

                        return tableRow;
                        
                    }
                })
        );
       
        rows.apply(BigQueryIO.writeTableRows()
                .to("project:SampleDataset.emptable")
                .withSchema(schema)
                .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
                .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
                .withMethod(BigQueryIO.Write.Method.STORAGE_WRITE_API)
        );

        p.run().waitUntilFinish();

    }
    
    private static TableSchema getSchemaFromResultSet(ResultSet resultSet) {
        FieldSchemaListBuilder fieldSchemaListBuilder = new FieldSchemaListBuilder();
        try {
            ResultSetMetaData rsmd = resultSet.getMetaData();

            for(int i=1; i <= rsmd.getColumnCount(); i++) {
                fieldSchemaListBuilder.stringField(resultSet.getMetaData().getColumnName(i));
            }
        }
        catch (SQLException ex) {
            LOG.error("Error getting metadata: " + ex.getMessage());
        }

        return fieldSchemaListBuilder.schema();
    }
}

I have tried to assign a dummy schema to handle this compile time error and assigned schema value to the dummy schema, but that is creating a table with a dummy schema, not with the actual schema.

Can someone help me to understand the flow where I am missing and how I can get the schema from JdbcIO and assign it to BigQuery Sink Connector?

Answer 1

To load a schema within the pipeline itself as you're suggesting here, you can use BigQueryIO.write() and specify withSchemaFromView . In that case, you'd need to fetch the schema from the source database and wrap that in a PCollectionView (see Side inputs in the Beam programming guide).

You're using the storage write API, which likely requires a schema and specifically. Note that the BigQuery API for file loads can allow inferring schema from the file contents at load time, although I'm not completely sure if Beam supports this. I would encourage you to try using file loads and setting withSchemaUpdateOption(SchemaUpdateOption.ALLOW_FIELD_ADDITION) to see if that leads to the table creation behavior you're looking for.

How to pass Table schema from Source to Sink using Apache Beam?

Question

1 answers

solution1
0 2022-12-27 14:27:48

How to pass Table schema from Source to Sink using Apache Beam?

Question

1 answers

solution1 0 2022-12-27 14:27:48

solution1
0 2022-12-27 14:27:48