简体   繁体   English

Flink (1.3.2) 只向每个算子广播记录一次

[英]Flink (1.3.2) Broadcast record to every operator exactly once

I have an Executiongraph much like this:我有一个很像这样的执行图:

{"nodes":[{"id":1,"type":"Source: AggregatedData","pact":"Data Source","contents":"Source: AggregatedData","parallelism":1},{"id":2,"type":"AddVirtualKeyFunction","pact":"Operator","contents":"AddVirtualKeyFunction","parallelism":4,"predecessors":[{"id":1,"ship_strategy":"REBALANCE","side":"second"}]},{"id":3,"type":"Source: FilterInformation","pact":"Data Source","contents":"Source: FilterInformation","parallelism":1},{"id":4,"type":"BroadcastFilterInformation","pact":"Operator","contents":"BroadcastFilterInformation","parallelism":1,"predecessors":[{"id":3,"ship_strategy":"FORWARD","side":"second"}]},{"id":7,"type":"ConnectAndApplyFilterFunction","pact":"Operator","contents":"ConnectAndApplyFilterFunction","parallelism":4,"predecessors":[{"id":2,"ship_strategy":"HASH","side":"second"},{"id":4,"ship_strategy":"HASH","side":"second"}]},{"id":8,"type":"Sink: OutputFilteredData","pact":"Data Sink","contents":"Sink: OutputFilteredData","parallelism":4,"predecessors":[{"id":7,"ship_strategy":"FORWARD","side":"second"}]}]}

(can be visualized here: https://flink.apache.org/visualizer/ ) (可以在这里可视化: https : //flink.apache.org/visualizer/

I have a stream of aggregated data ("AggregatedData", ID = 1) which needs to be filtered by some filter coming from another stream ("FilterInformation", ID = 3).我有一个聚合数据流(“AggregatedData”,ID = 1),它需要被来自另一个流(“FilterInformation”,ID = 3)的某个过滤器过滤。

What I first did was using operator state in my "ConnectAndApplyFilterFunction" (ID = 7) which technically works fine, but is limited to a parallelism of 1.我首先做的是在我的“ConnectAndApplyFilterFunction”(ID = 7)中使用操作符状态,它在技术上工作正常,但仅限于 1 的并行度。

Currently, I'm doing some hack: In "AddVirtualKeyFunction" I map my aggregated data to a Tuple2<Integer, AggregatedData> where the Integer (f0) is a randomly generated number from 0 to 19:目前,我正在做一些 hack:在“AddVirtualKeyFunction”中,我将聚合数据映射到Tuple2<Integer, AggregatedData> ,其中 Integer (f0) 是从 0 到 19 的随机生成的数字:

@Override
public Tuple2<Integer, AggregatedData> map(AggregatedData value) throws Exception {
    return new Tuple2<>(ThreadLocalRandom.current().nextInt(this.virtualKeyCount), value);
}

The "BroadcastFilterInformation" is a flatMap which publishes a Tuple2<Integer, FilterInfo> 20 Times (with f0 0-19) every time it receives a new FilterInformation: “BroadcastFilterInformation”是一个 flatMap,它在每次收到新的 FilterInformation 时发布Tuple2<Integer, FilterInfo> 20 次(f0 0-19):

@Override
public void flatMap(FilterInfo filterInfo, Collector<Tuple2<Integer, FilterInfo>> collector) throws Exception {
    if (this.currentLatestTimestamp < filterInfo.getLastUpdateTime()) {
        this.currentLatestTimestamp = filterInfo.getLastUpdateTime();

        for (int i = 0; i < this.broadcastCount; i++) {
            collector.collect(new Tuple2<>(i, filterInfo));
        }
    }
}

I now connect both streams and key them by their "virtual key" ( Tuple2.f0 ).我现在连接两个流并通过它们的“虚拟密钥”( Tuple2.f0 )对它们进行Tuple2.f0 I keep 20 copies of my FilterInfo in a keyed state in "ConnectAndapplyFilterFunction" (ID = 7).我在“ConnectAndapplyFilterFunction”(ID = 7)中保留了 20 个处于键控状态的FilterInfo副本。

Works fine, I can use parallelism on my main path.工作正常,我可以在我的主要路径上使用并行性。 But why do I use 20 "virtual keys" while my parallelism is only 4?但是为什么我使用 20 个“虚拟键”而我的并行度只有 4 个? Because with only 4 keys, not all operators will be used (2 operators were not receiving any data in my test).因为只有 4 个键,并不是所有的操作符都会被使用(在我的测试中,2 个操作符没有接收到任何数据)。

Is there any way I can broadcast some data from one stream so that every operator on the other end receives it's own copy?有什么方法可以从一个流中广播一些数据,以便另一端的每个操作员都能收到自己的副本?

You can most probably use broadcast option for making the data available to the other instances in an operation.您最有可能使用broadcast选项使数据可用于操作中的其他实例。

In case of batch processing , you can make use of Broadcast variables , which according to the linked website is described as follows, a corresponding example can also be found there:批处理的情况下,您可以使用广播变量,根据链接的网站描述如下,也可以在那里找到相应的示例:

Broadcast variables allow you to make a data set available to all parallel instances of an operation, in addition to the regular input of the operation.除了操作的常规输入之外,广播变量允许您使数据集可用于操作的所有并行实例。 This is useful for auxiliary data sets, or data-dependent parameterization.这对于辅助数据集或数据相关参数化很有用。 The data set will then be accessible at the operator as a Collection.然后操作员可以访问该数据集作为集合。

In case of stream processing , you can add datastream.broadcast() for broadcasting a stream to the other.流处理的情况下,您可以添加datastream.broadcast()以将datastream.broadcast()广播到另一个。

According to the flink website - the broadcast function - Broadcasts elements (from one stream) to every partition.根据flink 网站- 广播功能 - 将元素(从一个流)广播到每个分区。

In the stream processing scenario, you need to remind yourself that you need to consider race conditions as data from either stream can come in any order.在流处理场景中,您需要提醒自己您需要考虑竞争条件,因为来自任一流的数据可以以任何顺序出现。

A sample code can be checked out here可以在此处查看示例代码

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM