简体   繁体   English

Apache Flink:为 DataStream 添加侧输入 API

[英]Apache Flink : Add side inputs for DataStream API

In my Java application, I have three DataStreams .在我的Java应用程序中,我有三个 DataStreams For example, for One stream data is consumed from Kafka, for another stream data is consumed from Apache Nifi.例如,一个 stream 数据从 Kafka 消费,另一个 stream 数据从 Apache Nifi 消费。 For these two streams Object type is different.对于这两个流 Object 类型不同。 For example, Stream-1 object type is Person, Stream-2 object type is Address.例如,Stream-1 object 类型为 Person,Stream-2 object 类型为 Address。

The third one is the broadcast stream (for this data is consumed from Kafka).第三个是广播 stream(因为这个数据是从 Kafka 消费的)。

Now I want to combine Stream-1 and Stream-2 in a Job class and want to split in the task process element.现在我想在作业 class 中组合 Stream-1 和 Stream-2 并希望在任务流程元素中拆分。 How to implement this?如何实施?

Note: Stream-1 is mainstream and Stream-2 is side input.注: Stream-1 为主流,Stream-2 为侧输入。 MainStream is continuously fetching data from Kafka. MainStream 不断从 Kafka 获取数据。 For Side Input, initially while the application is UP all table data is loaded from DB and then read new data when the table data is updated (not frequently).对于 Side Input,最初当应用程序启动时,所有表数据都从 DB 加载,然后在表数据更新时(不频繁)读取新数据。

Sample structure:样本结构:

DataStream<Person> stream-1 = env.addSource(read data from kafka)....
DataStream<Address> stream-2 = env.addSource(read data from nifi)....
BroadcastStream<String> BroadCastStream = stream-3.broadcast(read data from kafka);

I was referred to as the following links.我被称为以下链接。

FLIP-17 Side Inputs for DataStream API FLIP-17 数据流侧输入 API

jira/browse/FLINK-6131 jira/浏览/FLINK-6131

My Use case is:我的用例是:

Join stream with slowly evolving data: The side input that we use for enriching is evolving over time (Data is read from DB).将 stream 与缓慢演变的数据一起加入:我们用于丰富的侧输入随着时间的推移而演变(数据从数据库中读取)。 This can be done by waiting for some initial data to be available before processing the main input and the continuously ingesting new data into the internal side input structure as it arrives.这可以通过在处理主输入之前等待一些初始数据可用并在新数据到达时不断地将新数据摄取到内部输入结构中来完成。

Based on the latest response, the recommendation by @Arvid was in fact what was needed here.根据最新的回复,@Arvid 的推荐实际上正是这里所需要的。

Core of the answer:答案的核心:

You can easily join stream1 and stream2 even if they have different types.即使它们具有不同的类型,您也可以轻松地加入 stream1 和 stream2。 Then you can add the broadcast to the result然后您可以将广播添加到结果中

Links to doc and example , and a relevant snippet from the doc (the example is too long to be included in here): 文档示例的链接,以及文档中的相关片段(示例太长,无法包含在此处):

import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
 
...

DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...

orangeStream.join(greenStream)
    .where(<KeySelector>)
    .equalTo(<KeySelector>)
    .window(TumblingEventTimeWindows.of(Time.milliseconds(2)))
    .apply (new JoinFunction<Integer, Integer, String> (){
        @Override
        public String join(Integer first, Integer second) {
            return first + "," + second;
        }
    });

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM