简体   繁体   English

Apache Beam + 数据流对于仅 18k 数据来说太慢

[英]Apache Beam + Dataflow too slow for only 18k data

we need to execute heavy calculation on simple but numerous data.我们需要对简单但大量的数据执行大量计算。
Input data are rows in a BigQuery table, two columns: ID (Integer) and DATA (STRING).输入数据是 BigQuery 表中的行,两列:ID(整数)和 DATA(字符串)。 The DATA values are of the form "1#2#3#4#..." with 36 values. DATA 值的形式为“1#2#3#4#...”,有 36 个值。
Ouput data are the same form, but DATA are just transformed by an algorithm.输出数据是相同的形式,但数据只是通过算法进行了转换。
It's a "one for one" transformation.这是一个“一对一”的转变。

We have tried Apache Beam with Google Cloud Dataflow, but it does not work, there are errors as soon as several workers are instancied.我们已经尝试使用 Google Cloud Dataflow 的 Apache Beam,但它不起作用,一旦实例化了几个 worker 就会出现错误。
For our POC we use only 18k input rows, the target is about 1 million.对于我们的 POC,我们仅使用 18k 输入行,目标是大约 100 万。

Here is a light version of the class (I've removed the write part, the behaviour remains the same):这是 class 的精简版(我删除了写入部分,行为保持不变):

public class MyClass {

static MyService myService = new MyService();

static class ExtractDataFn extends DoFn<TableRow, KV<Long, String>> {
    @ProcessElement
    public void processElement(ProcessContext c) {
        Long id = Long.parseLong((String) c.element().get("ID"));  
        String data = (String) c.element().get("DATA");         
        c.output(KV.of(id, data));
    }
}

public interface Options extends PipelineOptions {
    String getInput();
    void setInput(String value);

    @Default.Enum("EXPORT")
    TypedRead.Method getReadMethod();
    void setReadMethod(TypedRead.Method value);

    @Validation.Required
    String getOutput();
    void setOutput(String value);
}

static void run(Options options) {
    Pipeline p = Pipeline.create(options);

    List<TableFieldSchema> fields = new ArrayList<>();
    fields.add(new TableFieldSchema().setName("ID").setType("INTEGER"));
    fields.add(new TableFieldSchema().setName("DATA").setType("STRING"));
    TableSchema schema = new TableSchema().setFields(fields);

    PCollection<TableRow> rowsFromBigQuery = p.apply(
            BigQueryIO.readTableRows().from(options.getInput()).withMethod(options.getReadMethod())
    );              
    
    PCollection<KV<Long, String>> inputdata = rowsFromBigQuery.apply(ParDo.of(new ExtractDataFn()));
    PCollection<KV<Long, String>> outputData = applyTransform(inputdata);
    // Here goes the part where data are written in a BQ table
    p.run().waitUntilFinish();
}

static PCollection<KV<Long, String>> applyTransform(PCollection<KV<Long, String>> inputData) {      
    PCollection<KV<Long, String>> forecasts = inputData.apply(ParDo.of(new DoFn<KV<Long, String>, KV<Long, String>> () {
                    
        @ProcessElement
        public void processElement(@Element KV<Long, String> element, OutputReceiver<KV<Long, String>> receiver, ProcessContext c) {
            MyDto dto = new MyDto();
            List<Double> inputData = Arrays.asList(element.getValue().split("#")).stream().map(Double::valueOf).collect(Collectors.toList());
            dto.setInputData(inputData);                
            dto = myService.calculate(dto); // here is the time consuming operation
            String modifiedData = dto.getModifiedData().stream().map(Object::toString).collect(Collectors.joining(","));
            receiver.output(KV.of(element.getKey(), modifiedData));
        }
      }))
    ;
    return forecasts;
}

public static void main(String[] args) {
    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
    run(options);
}

} }

In the GCP Logs console we can see the number of workers increasing up to 10, during about 5 minutes, it decreases to 3 or 4, and then we have this kind of messages (several hundreds of them), and CPU is about 0%:在 GCP 日志控制台中,我们可以看到工作人员数量增加到 10,在大约 5 分钟内,它减少到 3 或 4,然后我们有这种消息(数百条),CPU 大约为 0% :

Proposing dynamic split of work unit myproject;2020-10-06_06_18_27-12689839210406435299;1231063355075246317 at {"fractionConsumed":0.5,"position":{"shufflePosition":"f_8A_wD_AAAB"}}

and

Operation ongoing in step BigQueryIO.Write/BatchLoads/SinglePartitionsReshuffle/GroupByKey/Read for at least 05m00s without outputting or completing in state read-shuffle at app//org.apache.beam.runners.dataflow.worker.ApplianceShuffleReader.readIncludingPosition(Native Method)

If we let it run it finishes in error of this kind:如果我们让它运行,它会以这种错误结束:

Error message from worker: java.lang.RuntimeException: unexpected org.apache.beam.runners.dataflow.worker.util.common.worker.CachingShuffleBatchReader.read(CachingShuffleBatchReader.java:77)

If I modify the myService.calculate method to be faster, all the data are treated by only one worker and there is no problem.如果我将 myService.calculate 方法修改得更快,所有数据都只由一个工作人员处理,就没有问题。 The problem seems to occured only when treatments are parallelized.该问题似乎仅在并行处理时出现。

Thank you for your help谢谢您的帮助

The solution was to configure the firewall by adding a rule allowing communication between workers.解决方案是通过添加允许工作人员之间通信的规则来配置防火墙。

https://cloud.google.com/dataflow/docs/guides/routes-firewall https://cloud.google.com/dataflow/docs/guides/routes-firewall

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 Apache Beam/Google Cloud Dataflow 上创建文件和数据流 - Creating a file and streaming in data on Apache Beam/Google Cloud Dataflow Google Dataflow 上的 Apache Beam 示例的权限错误 - Permissions error with Apache Beam example on Google Dataflow 使用 GCP 数据流和 Apache Beam Python SDK 从 GCS 读取速度非常慢 - Incredibly slow read from GCS with GCP Dataflow & Apache Beam Python SDK Spring Cloud Dataflow 与 Apache Beam/GCP 数据流说明 - Spring Cloud Dataflow vs Apache Beam/GCP Dataflow Clarification 如何为 Apache Beam/Dataflow 经典模板(Python)和数据管道实现 CI/CD 管道 - How to implement a CI/CD pipeline for Apache Beam/Dataflow classic templates (Python) & data pipelines Apache Beam 数据流管道使用 Bazel 构建和部署 - Apache Beam Dataflow pipeline build and deploy with Bazel 无法使用 Dataflow Apache Beam 沉入 BigQuery - Can not sink to BigQuery using Dataflow Apache Beam Apache 光束中的开窗和水印:Google 数据流 - Windowing and Watermark in Apache beam : Google dataflow Apache Beam/Dataflow 不会丢弃来自 Pub/Sub 的延迟数据 - Apache Beam/Dataflow doesn't discard late data from Pub/Sub 使用 Dataflow 和 Apache Beam (Python) 将 Pub/Sub 中的流数据发布到 BigQuery - Issues streaming data from Pub/Sub into BigQuery using Dataflow and Apache Beam (Python)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM