简体   繁体   中英

Apache Beam + Dataflow too slow for only 18k data

we need to execute heavy calculation on simple but numerous data.
Input data are rows in a BigQuery table, two columns: ID (Integer) and DATA (STRING). The DATA values are of the form "1#2#3#4#..." with 36 values.
Ouput data are the same form, but DATA are just transformed by an algorithm.
It's a "one for one" transformation.

We have tried Apache Beam with Google Cloud Dataflow, but it does not work, there are errors as soon as several workers are instancied.
For our POC we use only 18k input rows, the target is about 1 million.

Here is a light version of the class (I've removed the write part, the behaviour remains the same):

public class MyClass {

static MyService myService = new MyService();

static class ExtractDataFn extends DoFn<TableRow, KV<Long, String>> {
    @ProcessElement
    public void processElement(ProcessContext c) {
        Long id = Long.parseLong((String) c.element().get("ID"));  
        String data = (String) c.element().get("DATA");         
        c.output(KV.of(id, data));
    }
}

public interface Options extends PipelineOptions {
    String getInput();
    void setInput(String value);

    @Default.Enum("EXPORT")
    TypedRead.Method getReadMethod();
    void setReadMethod(TypedRead.Method value);

    @Validation.Required
    String getOutput();
    void setOutput(String value);
}

static void run(Options options) {
    Pipeline p = Pipeline.create(options);

    List<TableFieldSchema> fields = new ArrayList<>();
    fields.add(new TableFieldSchema().setName("ID").setType("INTEGER"));
    fields.add(new TableFieldSchema().setName("DATA").setType("STRING"));
    TableSchema schema = new TableSchema().setFields(fields);

    PCollection<TableRow> rowsFromBigQuery = p.apply(
            BigQueryIO.readTableRows().from(options.getInput()).withMethod(options.getReadMethod())
    );              
    
    PCollection<KV<Long, String>> inputdata = rowsFromBigQuery.apply(ParDo.of(new ExtractDataFn()));
    PCollection<KV<Long, String>> outputData = applyTransform(inputdata);
    // Here goes the part where data are written in a BQ table
    p.run().waitUntilFinish();
}

static PCollection<KV<Long, String>> applyTransform(PCollection<KV<Long, String>> inputData) {      
    PCollection<KV<Long, String>> forecasts = inputData.apply(ParDo.of(new DoFn<KV<Long, String>, KV<Long, String>> () {
                    
        @ProcessElement
        public void processElement(@Element KV<Long, String> element, OutputReceiver<KV<Long, String>> receiver, ProcessContext c) {
            MyDto dto = new MyDto();
            List<Double> inputData = Arrays.asList(element.getValue().split("#")).stream().map(Double::valueOf).collect(Collectors.toList());
            dto.setInputData(inputData);                
            dto = myService.calculate(dto); // here is the time consuming operation
            String modifiedData = dto.getModifiedData().stream().map(Object::toString).collect(Collectors.joining(","));
            receiver.output(KV.of(element.getKey(), modifiedData));
        }
      }))
    ;
    return forecasts;
}

public static void main(String[] args) {
    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
    run(options);
}

}

In the GCP Logs console we can see the number of workers increasing up to 10, during about 5 minutes, it decreases to 3 or 4, and then we have this kind of messages (several hundreds of them), and CPU is about 0%:

Proposing dynamic split of work unit myproject;2020-10-06_06_18_27-12689839210406435299;1231063355075246317 at {"fractionConsumed":0.5,"position":{"shufflePosition":"f_8A_wD_AAAB"}}

and

Operation ongoing in step BigQueryIO.Write/BatchLoads/SinglePartitionsReshuffle/GroupByKey/Read for at least 05m00s without outputting or completing in state read-shuffle at app//org.apache.beam.runners.dataflow.worker.ApplianceShuffleReader.readIncludingPosition(Native Method)

If we let it run it finishes in error of this kind:

Error message from worker: java.lang.RuntimeException: unexpected org.apache.beam.runners.dataflow.worker.util.common.worker.CachingShuffleBatchReader.read(CachingShuffleBatchReader.java:77)

If I modify the myService.calculate method to be faster, all the data are treated by only one worker and there is no problem. The problem seems to occured only when treatments are parallelized.

Thank you for your help

The solution was to configure the firewall by adding a rule allowing communication between workers.

https://cloud.google.com/dataflow/docs/guides/routes-firewall

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM