在Apache Beam中使用BigQuery处理空的PCollections

Question

使用以下代码，尝试写入BigQuery时出现以下错误

我正在使用Apache-Beam 2.0.0

Exception in thread "main" org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.NullPointerException

如果我将text.startsWith更改为D ，则一切正常（即输出了一些内容）。

是否有某种方式可以捕获或监视空的PCollections？

根据StackTrace，看起来错误实际上在BigQueryIO中-我的存储桶中剩余的文件有0个字节，这可能导致BigQueryIO出现问题。

我的用例是，我在DeadLetters上使用了侧面输出，并且在我的工作没有产生任何死信输出时遇到了此错误，因此稳健地处理此问题将很有用。

这项工作确实应该能够以批处理或流模式运行，我的最佳猜测是在流模式下以批处理模式和GBQ将任何输出写入GCS / TextIO，如果听起来合理？

非常感谢任何帮助。

public class EmptyPCollection {

public static void main(String [] args) {

    PipelineOptions options = PipelineOptionsFactory.create();
    options.setTempLocation("gs://<your-bucket-here>/temp");
    Pipeline pipeline = Pipeline.create(options);
    String schema = "{\"fields\": [{\"name\": \"pet\", \"type\": \"string\", \"mode\": \"required\"}]}";
    String table = "<your-dataset>.<your-table>";
    List<String> pets = Arrays.asList("Dog", "Cat", "Goldfish");
    PCollection<String> inputText = pipeline.apply(Create.of(pets)).setCoder(StringUtf8Coder.of());
    PCollection<TableRow> rows = inputText.apply(ParDo.of(new DoFn<String, TableRow>() {
        @ProcessElement
        public void processElement(ProcessContext c) {
            String text = c.element();
            if (text.startsWith("X")) {  // change to (D)og and works fine
                TableRow row = new TableRow();
                row.set("pet", text);
                c.output(row);
            }
        }
    }));

    rows.apply(BigQueryIO.writeTableRows().to(table).withJsonSchema(schema)
            .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
            .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));

    pipeline.run().waitUntilFinish();

}

}

[direct-runner-worker] INFO org.apache.beam.sdk.io.gcp.bigquery.TableRowWriter - Opening TableRowWriter to gs://<your-bucket>/temp/BigQueryWriteTemp/05c7a7c0786a4656abad97f11ef23d8e/2675e1c7-f4d7-4f78-a85f-a38095b57e6b.

Exception in thread "main" org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.NullPointerException
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:322)
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:292)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:200)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:63)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:295)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:281)
at EmptyPCollection.main(EmptyPCollection.java:54)
Caused by: java.lang.NullPointerException
at org.apache.beam.sdk.io.gcp.bigquery.WriteTables.processElement(WriteTables.java:97)

Answer 1

这看起来像是Apache Beam中BigQuery接收器实现中的错误。 在Apache Beam Jira中填充错误将是提交此错误的合适位置。

我已提交https://issues.apache.org/jira/browse/BEAM-2406来跟踪此问题。

在Apache Beam中使用BigQuery处理空的PCollections

问题描述

1 个解决方案

解决方案1
3 已采纳 2017-06-03 00:26:31

在Apache Beam中使用BigQuery处理空的PCollections

问题描述

1 个解决方案

解决方案1 3 已采纳 2017-06-03 00:26:31

解决方案1
3 已采纳 2017-06-03 00:26:31