简体   繁体   English

在Apache Beam中使用BigQuery处理空的PCollections

[英]Handling empty PCollections with BigQuery in Apache Beam

Using the following code, I am getting the following errors when trying to write to BigQuery 使用以下代码,尝试写入BigQuery时出现以下错误

I am using Apache-Beam 2.0.0 我正在使用Apache-Beam 2.0.0

Exception in thread "main" org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.NullPointerException

If I change the text.startsWith to D , everything works fine (ie so something is output). 如果我将text.startsWith更改为D ,则一切正常(即输出了一些内容)。

Is there someway to catch or watch for empty PCollections? 是否有某种方式可以捕获或监视空的PCollections?

Based on the StackTrace it looks like the error is actually in BigQueryIO - the file left in my bucket has 0 bytes and maybe this is causing BigQueryIO a problem. 根据StackTrace,看起来错误实际上在BigQueryIO中-我的存储桶中剩余的文件有0个字节,这可能导致BigQueryIO出现问题。

My use case is that I am using side outputs for DeadLetters and encountered this error when my job produced no dead-letter output, so robustly handling this would be useful. 我的用例是,我在DeadLetters上使用了侧面输出,并且在我的工作没有产生任何死信输出时遇到了此错误,因此稳健地处理此问题将很有用。

The job should really be able to run in batch or streaming mode, my best guess is to write any output to GCS / TextIO in batch mode and GBQ when streaming, if that sounds sensible? 这项工作确实应该能够以批处理或流模式运行,我的最佳猜测是在流模式下以批处理模式和GBQ将任何输出写入GCS / TextIO,如果听起来合理?

Any help gratefully received. 非常感谢任何帮助。

public class EmptyPCollection {

public static void main(String [] args) {

    PipelineOptions options = PipelineOptionsFactory.create();
    options.setTempLocation("gs://<your-bucket-here>/temp");
    Pipeline pipeline = Pipeline.create(options);
    String schema = "{\"fields\": [{\"name\": \"pet\", \"type\": \"string\", \"mode\": \"required\"}]}";
    String table = "<your-dataset>.<your-table>";
    List<String> pets = Arrays.asList("Dog", "Cat", "Goldfish");
    PCollection<String> inputText = pipeline.apply(Create.of(pets)).setCoder(StringUtf8Coder.of());
    PCollection<TableRow> rows = inputText.apply(ParDo.of(new DoFn<String, TableRow>() {
        @ProcessElement
        public void processElement(ProcessContext c) {
            String text = c.element();
            if (text.startsWith("X")) {  // change to (D)og and works fine
                TableRow row = new TableRow();
                row.set("pet", text);
                c.output(row);
            }
        }
    }));

    rows.apply(BigQueryIO.writeTableRows().to(table).withJsonSchema(schema)
            .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
            .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));

    pipeline.run().waitUntilFinish();

}

} }

[direct-runner-worker] INFO org.apache.beam.sdk.io.gcp.bigquery.TableRowWriter - Opening TableRowWriter to gs://<your-bucket>/temp/BigQueryWriteTemp/05c7a7c0786a4656abad97f11ef23d8e/2675e1c7-f4d7-4f78-a85f-a38095b57e6b.

Exception in thread "main" org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.NullPointerException
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:322)
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:292)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:200)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:63)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:295)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:281)
at EmptyPCollection.main(EmptyPCollection.java:54)
Caused by: java.lang.NullPointerException
at org.apache.beam.sdk.io.gcp.bigquery.WriteTables.processElement(WriteTables.java:97)

This looks like a bug in the BigQuery sink implementation within Apache Beam. 这看起来像是Apache Beam中BigQuery接收器实现中的错误。 Filing a bug in the Apache Beam Jira would be the appropriate place to file this. 在Apache Beam Jira中填充错误将是提交此错误的合适位置。

I have filed https://issues.apache.org/jira/browse/BEAM-2406 to track this issue. 我已提交https://issues.apache.org/jira/browse/BEAM-2406来跟踪此问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM