简体   繁体   中英

external api call in apache beam dataflow

I have an use case where, I read in the newline json elements stored in google cloud storage and start processing each json. While processing each json, I have to call an external API for doing de-duplication whether that json element was discovered previously. I'm doing a ParDo with a DoFn on each json.

I haven't seen any online tutorial saying how to call an external API endpoint from apache beam DoFn Dataflow.

I'm using JAVA SDK of Beam. Some of the tutorial I studied explained that using startBundle and FinishBundle but I'm not clear on how to use it

If you need to check duplicates in external storage for every JSON record, then you still can use DoFn for that. There are several annotations, like @Setup , @StartBundle , @FinishBundle , etc, that can be used to annotate methods in your DoFn .

For example, if you need to instantiate a client object to send requests to your external database, then you might want to do this in @Setup method (like POJO constructor) and then leverage this client object in your @ProcessElement method.

Let's consider a simple example:

static class MyDoFn extends DoFn<Record, Record> {

    static transient MyClient client;

    @Setup
    public void setup() {
        client = new MyClient("host");
    }

    @ProcessElement
    public void processElement(ProcessContext c) {
        // process your records
        Record r = c.element();
        // check record ID for duplicates
        if (!client.isRecordExist(r.id()) {
            c.output(r);
        }
    }

    @Teardown
    public void teardown() {
        if (client != null) {
            client.close();
            client = null;
        }
    }
}

Also, to avoid doing remote calls for every record, you can batch bundle records into internal buffer (Beam split input data into bundles) and check duplicates in batch mode (if your client support this). For this purpose, you might use @StartBundle and @FinishBundle annotated methods that will be called right before and after processing Beam bundle accordingly.

For more complicated examples, I'd recommend to take a look on a Sink implementations in different Beam IOs, like KinesisIO , for instance.

There is an example of calling external system in batches using a stateful DoFn in the following blog post: https://beam.apache.org/blog/2017/08/28/timely-processing.html , might be helpful.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM