简体   繁体   中英

Using TableRowJsonCoder to convert PubSub Message to TableRow in BEAM

I am using Dataflow 1.9 (JAVA API) to read a Pubsub message and seamlessly stream that into BigQuery without explicitly setting each column in a TableRow . Below is the code snippet for the conversion.

PCollection<TableRow> payloadTableRow = pipeline
    .apply("Read",PubsubIO.Read.subscription(***MY_SUBSCRIPTION***)
    .withCoder(TableRowJsonCoder.of()));`

The above code works perfectly and I can see the Pubsub message in a topic gets converted to PCollection<TableRow> and then into BigQuery using BigQueryIO.Write .

When I try to emulate the same in Apache Beam, I couldn't set the TableRowJsonCoder for a PubSub message as Beam's PubSubIO lacks the method withCoder() . In Beam, I tried the setCoder() as below but getting compilation error. I even tried PubsubIO.readStrings but the error stays same.

pipeline
    .apply("Read",PubsubIO.readMessagesWithAttributes()
        .fromSubscription(***MY_SUBSCRIPTION***))
    .setCoder(TableRowJsonCoder.of())`

I am seeing the withCoder() exists in Dataflow 1.9 but the missing feature impedes me to upgrade to Beam.

My questions are:

  • Does Beam's PubSubIO class have anything similar to withCoder() so that I can move to Beam?
  • If so, how can I say PubSubIO for this implicit conversion of TableRowJsonCoder.of() .
  • Would be helpful if I can get to see couple of lines of code snippet for solution in Beam (Java API).

UPDATE

As Kenn Knowles rightly pointed out, I have used MapElements and pulled out the byte[] and then transformed that to TableRow as below.

    PCollection<byte[]> payloadByteArray = payladInPubSubMessage.apply(
            MapElements.via(new SimpleFunction<PubsubMessage, byte[]>() {
                @Override
                public byte[] apply(PubsubMessage input) {
                    return input.getPayload();
                }
            }));

    PCollection<TableRow> payladTableRow = payloadByteArray.apply(
            MapElements.via(new SimpleFunction<byte[], TableRow>() {
                @Override
                public TableRow apply(byte[] input) {
                    TableRow tableRow = null;
                    try{
                        tableRow = TableRowJsonCoder.of().decode(new ByteArrayInputStream(input));
                    }
                    catch (Exception ex){
                        ex.printStackTrace();
                    }
                    return tableRow;
                }
            }));

Now I am encountering EOFException while transforming the byte array to TableRow using TableRowJsonCoder.of().decode() . I sensed I am missing some sort of Coder for TableRow and registered a coder as below.

CoderRegistry registry = pipeline.getCoderRegistry();
registry.registerCoderForClass(TableRow.class,TableRowJsonCoder.of());

This doesn't seem to solve the issue and I would like to get some insight on the error below:

Caused by: org.apache.beam.sdk.coders.CoderException: java.io.EOFException
at org.apache.beam.sdk.coders.StringUtf8Coder.decode(StringUtf8Coder.java:110)
at org.apache.beam.sdk.io.gcp.bigquery.TableRowJsonCoder.decode(TableRowJsonCoder.java:61)
at org.apache.beam.sdk.io.gcp.bigquery.TableRowJsonCoder.decode(TableRowJsonCoder.java:55)
at com.gcp.poc.transformers.TableRowTransformer.processElement(TableRowTransformer.java:48)
Caused by: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at org.apache.beam.sdk.coders.StringUtf8Coder.readString(StringUtf8Coder.java:63)
at org.apache.beam.sdk.coders.StringUtf8Coder.decode(StringUtf8Coder.java:106)
at org.apache.beam.sdk.io.gcp.bigquery.TableRowJsonCoder.decode(TableRowJsonCoder.java:61)
at org.apache.beam.sdk.io.gcp.bigquery.TableRowJsonCoder.decode(TableRowJsonCoder.java:55) .   

I hope I make sense and would love to get a solution for this decoding issue for TableRow.

In Beam, IO connectors are simplified to output their most natural type. For PubsubIO it is PubsubMessage . From there, you can perform arbitrary processing on the messages.

For your specific example, you would use PubsubIO.readMessages() followed by MapElements to pull out the byte[] payload and parse it to a TableRow .

The TableRowJsonCoder describse how to encode/decode elements of type TableRow when passing them between points in a pipeline. Instead of calling TableRowJsonCoder.of().decode(...) within your MapElements , you should actually examine the bytes that you have received from PubSub, and parse them into some meaningful form. This could be creating a TableRow using the various methods for doing so, as shown in the various Beam examples, such as BigQueryTornadoes .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM