[英]Why is Apache Flink droping the event from datastream?
在以下单元测试用例中,将生成由numberOfElements指定的某些事件,并将其作为数据流提供。 该单位案例在生产线上随机失败。
assertEquals(numberOfElements,CollectSink.values.size());
为什么Apache Flink跳过事件的任何解释。
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.sink.SinkFunction;
import org.junit.Before;
import org.junit.Test;
import java.util.ArrayList;
import java.util.List;
import static java.lang.Thread.sleep;
import static org.junit.Assert.assertEquals;
public class FlinkTest {
StreamExecutionEnvironment env;
@Before
public void setup() {
env = StreamExecutionEnvironment.createLocalEnvironment();
}
@Test
public void testStream1() throws Exception {
testStream();
}
@Test
public void testStream2() throws Exception {
testStream();
}
@Test
public void testStream3() throws Exception {
testStream();
}
@Test
public void testStream4() throws Exception {
testStream();
}
@Test
public void testStream() throws Exception {
final int numberOfElements = 50;
DataStream<Tuple2<String, Integer>> tupleStream = env.fromCollection(getCollectionOfBucketImps(numberOfElements));
CollectSink.values.clear();
tupleStream.addSink(new CollectSink());
env.execute();
sleep(2000);
assertEquals(numberOfElements, getCollectionOfBucketImps(numberOfElements).size());
assertEquals(numberOfElements, CollectSink.values.size());
}
public static List<Tuple2<String, Integer>> getCollectionOfBucketImps(int numberOfElements) throws InterruptedException {
List<Tuple2<String, Integer>> records = new ArrayList<>();
for (int i = 0; i < numberOfElements; i++) {
records.add(new Tuple2<>(Integer.toString(i % 10), i));
}
return records;
}
// create a testing sink
private static class CollectSink implements SinkFunction<Tuple2<String, Integer>> {
public static final List<Tuple2<String, Integer>> values = new ArrayList<>();
@Override
public synchronized void invoke(Tuple2<String, Integer> value, Context context) throws Exception {
values.add(value);
}
}
}
例如,testStreamX情况之一随机失败。
上下文:代码以8作为并行度setu运行,因为运行它的cpu具有8个内核
我不知道您的工作是否与众不同(我想这是Flink可以分配的最高限额)。 看起来您可以在接收器的增加值上具有“竞争”条件。
解
我已经运行了您的示例代码,将环境并行度设置为1,并且一切正常。 有关测试的文档示例使用此解决方案链接到文档 。
@Before
public void setup() {
env = StreamExecutionEnvironment.createLocalEnvironment();
env.setParallelism(1);
}
更好
您只能在接收器运算符上将并行度设置为1,并保持其余管道的并行度。 在以下示例中,我为tha映射运算符添加了一个额外的映射函数,强制并行度为8。
public void testStream() throws Exception {
final int numberOfElements = 50;
DataStream<Tuple2<String, Integer>> tupleStream = env.fromCollection(getCollectionOfBucketImps(numberOfElements));
CollectSink.values.clear();
tupleStream
.map(new MapFunction<Tuple2<String,Integer>, Tuple2<String,Integer>>() {
@Override
public Tuple2<String,Integer> map(Tuple2<String, Integer> stringIntegerTuple2) throws Exception {
stringIntegerTuple2.f0 += "- concat something";
return stringIntegerTuple2;
}
}).setParallelism(8)
.addSink(new CollectSink()).setParallelism(1);
env.execute();
sleep(2000);
assertEquals(numberOfElements, getCollectionOfBucketImps(numberOfElements).size());
assertEquals(numberOfElements, CollectSink.values.size());
}
当环境的并行计算大于1时,将有多个CollectSink
实例,这有可能导致竞争状况。
这些是避免竞争情况的解决方案:
private static class CollectSink implements SinkFunction<Tuple2<String, Integer>> {
public static final List<Tuple2<String, Integer>> values = new ArrayList<>();
@Override
public void invoke(Tuple2<String, Integer> value, Context context) throws Exception {
synchronized(CollectSink.class) {
values.add(value);
}
}
}
Collections.synchronizedList()
import java.util.Collections;
private static class CollectSink implements SinkFunction<Tuple2<String, Integer>> {
public static final List<Tuple2<String, Integer>> values = Collections.synchronizedList(new ArrayList<>());
@Override
public void invoke(Tuple2<String, Integer> value, Context context) throws Exception {
values.add(value);
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.