简体   繁体   English

为什么Apache Flink从数据流中删除事件?

[英]Why is Apache Flink droping the event from datastream?

In the following unit test case, some event specified by numberOfElements is generated and fed as a data stream. 在以下单元测试用例中,将生成由numberOfElements指定的某些事件,并将其作为数据流提供。 This unit cases randomly fails at the line. 该单位案例在生产线上随机失败。

assertEquals(numberOfElements, CollectSink.values.size()); assertEquals(numberOfElements,CollectSink.values.size());

Any explanation why Apache Flink is skipping the events. 为什么Apache Flink跳过事件的任何解释。

import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.sink.SinkFunction;
import org.junit.Before;
import org.junit.Test;

import java.util.ArrayList;
import java.util.List;

import static java.lang.Thread.sleep;
import static org.junit.Assert.assertEquals;

public class FlinkTest {

StreamExecutionEnvironment env;

@Before
public void setup() {
    env = StreamExecutionEnvironment.createLocalEnvironment();
}

@Test
public void testStream1() throws Exception {
    testStream();
}

@Test
public void testStream2() throws Exception {
    testStream();
}

@Test
public void testStream3() throws Exception {
    testStream();
}

@Test
public void testStream4() throws Exception {
    testStream();
}


@Test
public void testStream() throws Exception {

    final int numberOfElements = 50;

    DataStream<Tuple2<String, Integer>> tupleStream = env.fromCollection(getCollectionOfBucketImps(numberOfElements));
    CollectSink.values.clear();
    tupleStream.addSink(new CollectSink());
    env.execute();
    sleep(2000);

    assertEquals(numberOfElements, getCollectionOfBucketImps(numberOfElements).size());
    assertEquals(numberOfElements, CollectSink.values.size());
}


public static List<Tuple2<String, Integer>> getCollectionOfBucketImps(int numberOfElements) throws InterruptedException {
    List<Tuple2<String, Integer>> records = new ArrayList<>();
    for (int i = 0; i < numberOfElements; i++) {
        records.add(new Tuple2<>(Integer.toString(i % 10), i));
    }
    return records;
}

// create a testing sink
private static class CollectSink implements SinkFunction<Tuple2<String, Integer>> {

    public static final List<Tuple2<String, Integer>> values = new ArrayList<>();

    @Override
    public synchronized void invoke(Tuple2<String, Integer> value, Context context) throws Exception {
        values.add(value);
    }
 }
}

For examples either of testStreamX case fails randomly. 例如,testStreamX情况之一随机失败。

Context: The code runs with 8 as parallelism setu since the cpu where it runs has 8 Cores 上下文:代码以8作为并行度setu运行,因为运行它的cpu具有8个内核

I don't know the paralellism of your jobs (i suppose that is the max that Flink can assign). 我不知道您的工作是否与众不同(我想这是Flink可以分配的最高限额)。 Looks like you can have a Race condition on the add value of your sink. 看起来您可以在接收器的增加值上具有“竞争”条件。

Solution

I have runned your example code, setting the environment parallelism to 1 and everything works fine. 我已经运行了您的示例代码,将环境并行度设置为1,并且一切正常。 The documentation examples about testing uses this solution link to documentation . 有关测试的文档示例使用此解决方案链接到文档

@Before
public void setup() {
    env = StreamExecutionEnvironment.createLocalEnvironment();
    env.setParallelism(1);
}

Even Better 更好

You can set the parallelism to 1 only on the sink operator and mantain the parallelism of the rest of the pipeline. 您只能在接收器运算符上将并行度设置为1,并保持其余管道的并行度。 In the following example, i added an extra map function with a forced parallelism of 8 for tha map operator. 在以下示例中,我为tha映射运算符添加了一个额外的映射函数,强制并行度为8。

public void testStream() throws Exception {

    final int numberOfElements = 50;

    DataStream<Tuple2<String, Integer>> tupleStream = env.fromCollection(getCollectionOfBucketImps(numberOfElements));
    CollectSink.values.clear();
    tupleStream
            .map(new MapFunction<Tuple2<String,Integer>, Tuple2<String,Integer>>() {
                @Override
                public Tuple2<String,Integer> map(Tuple2<String, Integer> stringIntegerTuple2) throws Exception {

                    stringIntegerTuple2.f0 += "- concat something";

                    return stringIntegerTuple2;
                }
            }).setParallelism(8)
            .addSink(new CollectSink()).setParallelism(1);
    env.execute();
    sleep(2000);

    assertEquals(numberOfElements, getCollectionOfBucketImps(numberOfElements).size());
    assertEquals(numberOfElements, CollectSink.values.size());
}

When a paralellism of the environment is greater than 1, there are multiple instances of CollectSink , which is possible to cause a race condition. 当环境的并行计算大于1时,将有多个CollectSink实例,这有可能导致竞争状况。

These are solutions to avoid the race condition: 这些是避免竞争情况的解决方案:

  1. Synchronize on class object 在类对象上同步
private static class CollectSink implements SinkFunction<Tuple2<String, Integer>> {

    public static final List<Tuple2<String, Integer>> values = new ArrayList<>();

    @Override
    public void invoke(Tuple2<String, Integer> value, Context context) throws Exception {
        synchronized(CollectSink.class) {
            values.add(value);
        }
    }
 }
  1. Collections.synchronizedList()
import java.util.Collections;
private static class CollectSink implements SinkFunction<Tuple2<String, Integer>> {

    public static final List<Tuple2<String, Integer>> values = Collections.synchronizedList(new ArrayList<>());

    @Override
    public void invoke(Tuple2<String, Integer> value, Context context) throws Exception {
        values.add(value);
    }
 }

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 DataStream上的Flink SQL查询(Apache Flink Java) - Flink sql Query on DataStream (Apache Flink Java) Apache Flink:为 DataStream 添加侧输入 API - Apache Flink : Add side inputs for DataStream API Apache Flink 将 DataStream(源)转换为 List? - Apache Flink transform DataStream (source) to a List? apache flink 0.10 如何从无界输入数据流中第一次出现复合键? - apache flink 0.10 how to get the first occurence of a composite key from an unbounded input dataStream? 将 DataStream 类型的对象从 Controller 传递到带有 Apache flink 和 Spring boot 的视图时出错 - Error when passing object of type DataStream from Controller to a view with Apache flink and Spring boot 处理两个 DataStream<string> 同时,要在 flink 中找到一个 DataStream 包含来自其他 DataStream 的值?</string> - Precessing two DataStream<String> simultaneously ,to find one DataStream contains values from other DataStream in flink? Apache Flink:不能将writeAsCsv()与子类元组的数据流一起使用 - Apache Flink: can't use writeAsCsv() with a datastream of subclass tuple 将包含 3 列的 CSV 文件读入 Datastream。 JAVA Apache Flink - Read CSV file with 3 columns into Datastream. JAVA Apache Flink Apache Flink:如何计算DataStream中的事件总数 - Apache Flink: How to count the total number of events in a DataStream 如何使用Java在Apache Flink中对DataStream执行平均操作 - How to perform average operation on DataStream in Apache Flink using Java
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM