Running external library with Cloud Dataflow

Question

I'm trying to run some external shared library functions with Cloud Dataflow similar to described here: Running external libraries with Cloud Dataflow for grid-computing workloads .

I have a couple of questions according to the approach.

There is the following passage in the article mentioned earlier :

In the case of making a call to an external library, you need to do this step manually for that library. The approach is to:

Store the code (along with versioning information) in Cloud Storage, this removes any concerns about throughput if running 10,000s of cores in the flow.

In the @beginBundle [sic] method, create a synchronized block to check if the file is available on the local resource. If not, use the Cloud Storage client library to pull the file across.

However, with my Java package, I simply put the library .so file into the src/main/resource/linux-x86-64 directory and call the library functions the following way (stripped to a bare minimum for brevity) :

import com.sun.jna.Library;
import com.sun.jna.Native;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.values.KV;

public class HostLookupPipeline {

  public interface LookupLibrary extends Library {
    String Lookup(String domain);
  }

  static class LookupFn extends DoFn<String, KV<String, String>> {
    private static LookupLibrary lookup;

    @StartBundle
    public void startBundle() {
      // src/main/resource/linux-x86-64/liblookup.so
      lookup = Native.loadLibrary("lookup", LookupLibrary.class);
    }

    @ProcessElement
    public void processElement(ProcessContext c) {
      String domain = c.element();
      String results = lookup.Lookup(domain);
      if (results != null) {
        c.output(KV.of(domain, results));
      }
    }
  }
}

Is such approach considered acceptable or extracting .so file from JAR performs poorly compared to downloading from GCS? If not, where should I put the file after downloading to make it accessible by the Cloud Dataflow worker?

I've noticed that the transformation calling the external library function works rather slow — about 90 elements/s — utilizing 15 Cloud Dataflow workers (autoscaling, default max workers). If my rough calculations are correct, it should be twice as fast. I suppose that's because I call the external library function for every element.

Are there any best practices to improve external libraries performance when running with Java?

Answer 1

The guidance in that blog post is slightly incorrect - a much better place to put the initialization code is the @Setup method, not @StartBundle .

@Setup is called to initialize an instance of your DoFn in every thread on every worker that will be executing it. It is the intended place for heavy setup code. Its counterpart is @Teardown .

@StartBundle and @FinishBundle are much finer granularity: per bundle , which is a quite low-level concept, and I believe the only common legitimate use for them writing batches of elements to an external service: then typically in @StartBundle you would initialize the next batch and in @FinishBundle flush it.

Generally, to debug the performance, try adding logging to your DoFn 's methods and see how many milliseconds the calls take and how that compares against your expectations. If you get stuck, include a Dataflow job ID in the question and an engineer will take a look at it.

Running external library with Cloud Dataflow

Question

1 answers

solution1
2 ACCPTED 2017-10-25 02:20:23

Running external library with Cloud Dataflow

Question

1 answers

solution1 2 ACCPTED 2017-10-25 02:20:23

solution1
2 ACCPTED 2017-10-25 02:20:23