简体   繁体   English

JAVA - Apache BEAM- GCP:GroupByKey 在 Direct Runner 中运行良好,但在 Dataflow runner 中失败

[英]JAVA - Apache BEAM- GCP: GroupByKey works fine with Direct Runner but fails with Dataflow runner

I tested my code with a Dataflow runner, however it returns an error:我用 Dataflow runner 测试了我的代码,但是它返回一个错误:

> Error message from worker: java.lang.RuntimeException:
> org.apache.beam.sdk.util.UserCodeException:
> com.fasterxml.jackson.core.JsonParseException: Unrecognized token
> 'WindowReiterable[ ] 
> org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn.processElement(GroupAlsoByWindowsParDoFn.java:114)
> org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
> org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
> org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:201)
> org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)
> org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:411)
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:380)
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:305)
> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:140)
> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:120)
> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:107)
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:748) Caused by:
> org.apache.beam.sdk.util.UserCodeException:
> com.fasterxml.jackson.core.JsonParseException: Unrecognized token
> 'WindowReiterable': was expecting ('true', 'false' or 'null') at
> [Source: (String)"WindowReiterable []

Note that I used the same code with a Direct Runner and it works just fine.请注意,我在 Direct Runner 中使用了相同的代码,并且效果很好。 Has anyone ever encountered this issue ?有没有人遇到过这个问题? If so can you please tell me how to solve it ?如果是这样,你能告诉我如何解决吗? Or should I replace the GroupByKey by another function ... ?或者我应该用另一个函数替换 GroupByKey ......?

Here is the code:这是代码:

PCollection<KV<String, Iterable<String>>> KVElements =
        pipeline.apply("Reads the input fixed-width file", TextIO
                .read()
                .from(options.getPolledFile())).apply("Converts to KV elements, ParDo.of(new DoFn<String, String>(){
            @ProcessElement
            public void processElement(ProcessContext c) {
                String element = c.element();
        String[] columns = (“key;col1;col2;col3”).split(";");
        String[] values = element.split(";");
            ObjectNode rowToJson = jsonParser.createObjectNode();
        for (int i = 0; i < columns.length; i++) {
             rowToJson.put(columns[i], values[i].trim());
        }

    c.output(KV.of(rowToJson.get(“key”).asText(), rowToJson.toString()));

}}));

PCollection <KV<String, Iterable<String>>> joinedCollection = KVElements.apply(GroupByKey.create());

PCollection  <String> joined = (PCollection<String>) joinedCollection.apply("Converts to json string", ParDo.of(new DoFn<KV<String, Iterable<String>>, String>(){

    @ProcessElement
    public void processElement(ProcessContext c) throws IOException {
        KV<String, Iterable<String>> element = c.element();
        JsonNode parsed = jsonParser.readTree(String.valueOf(element.getValue()));
        final ObjectMapper mapper = new ObjectMapper();
        ObjectNode KVJson = mapper.createObjectNode();
        String value = null;

        for (int i =0; i<parsed.size();i++){
            KVJson.put("col1",parsed.get(i).get("col1"));
            KVJson.put("col2",parsed.get(i).get("col2"));
            KVJson.put("col3",parsed.get(i).get("col3"));
            }

        c.output(KVJson.toString());

}}));

Version of Apache Beam : 2.17.0 Apache Beam 版本:2.17.0

Looks like the ParDo is not defined correctly.看起来 ParDo 定义不正确。 In the code snippet在代码片段中

"Converts to KV elements, ParDo.of(new DoFn<String, String>

should be changed to match the KV result that is being generated as output, something like below应该更改以匹配作为输出生成的 KV 结果,如下所示

"Converts to KV elements, ParDo.of(new DoFn<String, KV<String, Iterable<String>>>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Apache Beam与数据流运行器中的聚合器 - Aggregators in Apache beam with dataflow runner 在Eclipse上使用Dataflow Runner的Apache Beam MinimalWordcount示例 - Apache Beam MinimalWordcount example with Dataflow Runner on eclipse 如何使用java中的Apache Beam直达写入BigTable? - How to write to BigTable using Apache Beam direct-runner in java? 如何将 Apache Beam 直接运行器添加到类路径? - How to add Apache Beam direct runner to classpath? 如何为 java 的 apache 光束管道配置火花流道 - How to configure spark runner for apache beam pipeline for java Apache Python 与 Java 之间的光束性能在 GCP 数据流上运行 - Apache Beam Performance Between Python Vs Java Running on GCP Dataflow 带有Kafka源和数据流运行器的Beam Java SDK 2.10.0:窗口Count.perElement永远不会触发数据 - Beam java SDK 2.10.0 with Kafka source and Dataflow runner: windowed Count.perElement never fires data out 通过Apache Beam使用直接运行程序时,无法为BigQuery数据集设置区域 - Cannot set region for BigQuery dataset when using direct runner using Apache Beam 从 Apache Beam (GCP Dataflow) 写入 ConfluentCloud - Write to ConfluentCloud from Apache Beam (GCP Dataflow) Apache Beam:未指定运行器,并且在类路径中未找到DirectRunner - Apache beam: No Runner was specified and the DirectRunner was not found on the classpath
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM