Apache Beam + 数据流对于仅 18k 数据来说太慢

Question

we need to execute heavy calculation on simple but numerous data.我们需要对简单但大量的数据执行大量计算。
Input data are rows in a BigQuery table, two columns: ID (Integer) and DATA (STRING).输入数据是 BigQuery 表中的行，两列：ID（整数）和 DATA（字符串）。 The DATA values are of the form "1#2#3#4#..." with 36 values. DATA 值的形式为“1#2#3#4#...”，有 36 个值。
Ouput data are the same form, but DATA are just transformed by an algorithm.输出数据是相同的形式，但数据只是通过算法进行了转换。
It's a "one for one" transformation.这是一个“一对一”的转变。

We have tried Apache Beam with Google Cloud Dataflow, but it does not work, there are errors as soon as several workers are instancied.我们已经尝试使用 Google Cloud Dataflow 的 Apache Beam，但它不起作用，一旦实例化了几个 worker 就会出现错误。
For our POC we use only 18k input rows, the target is about 1 million.对于我们的 POC，我们仅使用 18k 输入行，目标是大约 100 万。

Here is a light version of the class (I've removed the write part, the behaviour remains the same):这是 class 的精简版（我删除了写入部分，行为保持不变）：

public class MyClass {

static MyService myService = new MyService();

static class ExtractDataFn extends DoFn<TableRow, KV<Long, String>> {
    @ProcessElement
    public void processElement(ProcessContext c) {
        Long id = Long.parseLong((String) c.element().get("ID"));  
        String data = (String) c.element().get("DATA");         
        c.output(KV.of(id, data));
    }
}

public interface Options extends PipelineOptions {
    String getInput();
    void setInput(String value);

    @Default.Enum("EXPORT")
    TypedRead.Method getReadMethod();
    void setReadMethod(TypedRead.Method value);

    @Validation.Required
    String getOutput();
    void setOutput(String value);
}

static void run(Options options) {
    Pipeline p = Pipeline.create(options);

    List<TableFieldSchema> fields = new ArrayList<>();
    fields.add(new TableFieldSchema().setName("ID").setType("INTEGER"));
    fields.add(new TableFieldSchema().setName("DATA").setType("STRING"));
    TableSchema schema = new TableSchema().setFields(fields);

    PCollection<TableRow> rowsFromBigQuery = p.apply(
            BigQueryIO.readTableRows().from(options.getInput()).withMethod(options.getReadMethod())
    );              
    
    PCollection<KV<Long, String>> inputdata = rowsFromBigQuery.apply(ParDo.of(new ExtractDataFn()));
    PCollection<KV<Long, String>> outputData = applyTransform(inputdata);
    // Here goes the part where data are written in a BQ table
    p.run().waitUntilFinish();
}

static PCollection<KV<Long, String>> applyTransform(PCollection<KV<Long, String>> inputData) {      
    PCollection<KV<Long, String>> forecasts = inputData.apply(ParDo.of(new DoFn<KV<Long, String>, KV<Long, String>> () {
                    
        @ProcessElement
        public void processElement(@Element KV<Long, String> element, OutputReceiver<KV<Long, String>> receiver, ProcessContext c) {
            MyDto dto = new MyDto();
            List<Double> inputData = Arrays.asList(element.getValue().split("#")).stream().map(Double::valueOf).collect(Collectors.toList());
            dto.setInputData(inputData);                
            dto = myService.calculate(dto); // here is the time consuming operation
            String modifiedData = dto.getModifiedData().stream().map(Object::toString).collect(Collectors.joining(","));
            receiver.output(KV.of(element.getKey(), modifiedData));
        }
      }))
    ;
    return forecasts;
}

public static void main(String[] args) {
    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
    run(options);
}

} }

In the GCP Logs console we can see the number of workers increasing up to 10, during about 5 minutes, it decreases to 3 or 4, and then we have this kind of messages (several hundreds of them), and CPU is about 0%:在 GCP 日志控制台中，我们可以看到工作人员数量增加到 10，在大约 5 分钟内，它减少到 3 或 4，然后我们有这种消息（数百条），CPU 大约为 0% :

Proposing dynamic split of work unit myproject;2020-10-06_06_18_27-12689839210406435299;1231063355075246317 at {"fractionConsumed":0.5,"position":{"shufflePosition":"f_8A_wD_AAAB"}}

and和

Operation ongoing in step BigQueryIO.Write/BatchLoads/SinglePartitionsReshuffle/GroupByKey/Read for at least 05m00s without outputting or completing in state read-shuffle at app//org.apache.beam.runners.dataflow.worker.ApplianceShuffleReader.readIncludingPosition(Native Method)

If we let it run it finishes in error of this kind:如果我们让它运行，它会以这种错误结束：

Error message from worker: java.lang.RuntimeException: unexpected org.apache.beam.runners.dataflow.worker.util.common.worker.CachingShuffleBatchReader.read(CachingShuffleBatchReader.java:77)

If I modify the myService.calculate method to be faster, all the data are treated by only one worker and there is no problem.如果我将 myService.calculate 方法修改得更快，所有数据都只由一个工作人员处理，就没有问题。 The problem seems to occured only when treatments are parallelized.该问题似乎仅在并行处理时出现。

Thank you for your help谢谢您的帮助

Answer 1

The solution was to configure the firewall by adding a rule allowing communication between workers.解决方案是通过添加允许工作人员之间通信的规则来配置防火墙。

https://cloud.google.com/dataflow/docs/guides/routes-firewall https://cloud.google.com/dataflow/docs/guides/routes-firewall

Apache Beam + 数据流对于仅 18k 数据来说太慢

问题描述

1 个解决方案

解决方案1
1 2020-10-27 17:30:22

Apache Beam + 数据流对于仅 18k 数据来说太慢

问题描述

1 个解决方案

解决方案1 1 2020-10-27 17:30:22

解决方案1
1 2020-10-27 17:30:22