Flink (1.3.2) 只向每个算子广播记录一次

Question

I have an Executiongraph much like this:我有一个很像这样的执行图：

{"nodes":[{"id":1,"type":"Source: AggregatedData","pact":"Data Source","contents":"Source: AggregatedData","parallelism":1},{"id":2,"type":"AddVirtualKeyFunction","pact":"Operator","contents":"AddVirtualKeyFunction","parallelism":4,"predecessors":[{"id":1,"ship_strategy":"REBALANCE","side":"second"}]},{"id":3,"type":"Source: FilterInformation","pact":"Data Source","contents":"Source: FilterInformation","parallelism":1},{"id":4,"type":"BroadcastFilterInformation","pact":"Operator","contents":"BroadcastFilterInformation","parallelism":1,"predecessors":[{"id":3,"ship_strategy":"FORWARD","side":"second"}]},{"id":7,"type":"ConnectAndApplyFilterFunction","pact":"Operator","contents":"ConnectAndApplyFilterFunction","parallelism":4,"predecessors":[{"id":2,"ship_strategy":"HASH","side":"second"},{"id":4,"ship_strategy":"HASH","side":"second"}]},{"id":8,"type":"Sink: OutputFilteredData","pact":"Data Sink","contents":"Sink: OutputFilteredData","parallelism":4,"predecessors":[{"id":7,"ship_strategy":"FORWARD","side":"second"}]}]}

(can be visualized here: https://flink.apache.org/visualizer/ ) （可以在这里可视化： https : //flink.apache.org/visualizer/ ）

I have a stream of aggregated data ("AggregatedData", ID = 1) which needs to be filtered by some filter coming from another stream ("FilterInformation", ID = 3).我有一个聚合数据流（“AggregatedData”，ID = 1），它需要被来自另一个流（“FilterInformation”，ID = 3）的某个过滤器过滤。

What I first did was using operator state in my "ConnectAndApplyFilterFunction" (ID = 7) which technically works fine, but is limited to a parallelism of 1.我首先做的是在我的“ConnectAndApplyFilterFunction”（ID = 7）中使用操作符状态，它在技术上工作正常，但仅限于 1 的并行度。

Currently, I'm doing some hack: In "AddVirtualKeyFunction" I map my aggregated data to a Tuple2<Integer, AggregatedData> where the Integer (f0) is a randomly generated number from 0 to 19:目前，我正在做一些 hack：在“AddVirtualKeyFunction”中，我将聚合数据映射到Tuple2<Integer, AggregatedData> ，其中 Integer (f0) 是从 0 到 19 的随机生成的数字：

@Override
public Tuple2<Integer, AggregatedData> map(AggregatedData value) throws Exception {
    return new Tuple2<>(ThreadLocalRandom.current().nextInt(this.virtualKeyCount), value);
}

The "BroadcastFilterInformation" is a flatMap which publishes a Tuple2<Integer, FilterInfo> 20 Times (with f0 0-19) every time it receives a new FilterInformation: “BroadcastFilterInformation”是一个 flatMap，它在每次收到新的 FilterInformation 时发布Tuple2<Integer, FilterInfo> 20 次（f0 0-19）：

@Override
public void flatMap(FilterInfo filterInfo, Collector<Tuple2<Integer, FilterInfo>> collector) throws Exception {
    if (this.currentLatestTimestamp < filterInfo.getLastUpdateTime()) {
        this.currentLatestTimestamp = filterInfo.getLastUpdateTime();

        for (int i = 0; i < this.broadcastCount; i++) {
            collector.collect(new Tuple2<>(i, filterInfo));
        }
    }
}

I now connect both streams and key them by their "virtual key" ( Tuple2.f0 ).我现在连接两个流并通过它们的“虚拟密钥”（ Tuple2.f0 ）对它们进行Tuple2.f0 。 I keep 20 copies of my FilterInfo in a keyed state in "ConnectAndapplyFilterFunction" (ID = 7).我在“ConnectAndapplyFilterFunction”（ID = 7）中保留了 20 个处于键控状态的FilterInfo副本。

Works fine, I can use parallelism on my main path.工作正常，我可以在我的主要路径上使用并行性。 But why do I use 20 "virtual keys" while my parallelism is only 4?但是为什么我使用 20 个“虚拟键”而我的并行度只有 4 个？ Because with only 4 keys, not all operators will be used (2 operators were not receiving any data in my test).因为只有 4 个键，并不是所有的操作符都会被使用（在我的测试中，2 个操作符没有接收到任何数据）。

Is there any way I can broadcast some data from one stream so that every operator on the other end receives it's own copy?有什么方法可以从一个流中广播一些数据，以便另一端的每个操作员都能收到自己的副本？

Answer 1

You can most probably use broadcast option for making the data available to the other instances in an operation.您最有可能使用broadcast选项使数据可用于操作中的其他实例。

In case of batch processing , you can make use of Broadcast variables , which according to the linked website is described as follows, a corresponding example can also be found there:在批处理的情况下，您可以使用广播变量，根据链接的网站描述如下，也可以在那里找到相应的示例：

Broadcast variables allow you to make a data set available to all parallel instances of an operation, in addition to the regular input of the operation.除了操作的常规输入之外，广播变量允许您使数据集可用于操作的所有并行实例。 This is useful for auxiliary data sets, or data-dependent parameterization.这对于辅助数据集或数据相关参数化很有用。 The data set will then be accessible at the operator as a Collection.然后操作员可以访问该数据集作为集合。

In case of stream processing , you can add datastream.broadcast() for broadcasting a stream to the other.在流处理的情况下，您可以添加datastream.broadcast()以将datastream.broadcast()广播到另一个。

According to the flink website - the broadcast function - Broadcasts elements (from one stream) to every partition.根据flink 网站- 广播功能 - 将元素（从一个流）广播到每个分区。

In the stream processing scenario, you need to remind yourself that you need to consider race conditions as data from either stream can come in any order.在流处理场景中，您需要提醒自己您需要考虑竞争条件，因为来自任一流的数据可以以任何顺序出现。

A sample code can be checked out here可以在此处查看示例代码

Flink (1.3.2) 只向每个算子广播记录一次

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-10-02 12:03:07

Flink (1.3.2) 只向每个算子广播记录一次

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-10-02 12:03:07

解决方案1
1 已采纳 2017-10-02 12:03:07