[英]Single source multiple sinks v/s flatmap
I'm using Kinesis Data Analytics on Flink to do stream processing.我在 Flink 上使用 Kinesis Data Analytics 进行 stream 处理。
The usecase that I'm working on is to read records from a single Kinesis stream and after some transformations write to multiple S3 buckets.我正在处理的用例是从单个 Kinesis stream 读取记录,并在进行一些转换后写入多个 S3 存储桶。 One source record might end up in multiple S3 buckets.一条源记录可能会出现在多个 S3 存储桶中。 We need to write to multiple buckets since the source record contains a lot of information which needs to be split to multiple S3 buckets.我们需要写入多个存储桶,因为源记录包含大量需要拆分到多个 S3 存储桶的信息。
I tried achieving this using multiple sinks.我尝试使用多个接收器来实现这一点。
private static <T> SinkFunction<T> createS3SinkFromStaticConfig(String path, Class<T> type) {
OutputFileConfig config = OutputFileConfig
.builder()
.withPartSuffix(".snappy.parquet")
.build();
final StreamingFileSink<T> sink = StreamingFileSink
.forBulkFormat(new Path(s3SinkPath + "/" + path), createParquetWriter(type))
.withBucketAssigner(new S3BucketAssigner<T>())
.withOutputFileConfig(config)
.withRollingPolicy(new RollingPolicy<T>(DEFAULT_MAX_PART_SIZE, DEFAULT_ROLLOVER_INTERVAL))
.build();
return sink;
}
public static void main(String[] args) throws Exception {
DataStream<PIData> input = createSourceFromStaticConfig(env)
.map(new JsonToSourceDataMap())
.name("jsonToInputDataTransformation");
input.map(value -> value)
.name("rawData")
.addSink(createS3SinkFromStaticConfig("raw_data", InputData.class))
.name("s3Sink");
input.map(FirstConverter::convertInputData)
.addSink(createS3SinkFromStaticConfig("firstOutput", Output1.class));
input.map(SecondConverter::convertInputData)
.addSink(createS3SinkFromStaticConfig("secondOutput", Output2.class));
input.map(ThirdConverter::convertInputData)
.addSink(createS3SinkFromStaticConfig("thirdOutput", Output3.class));
//and so on; There are around 10 buckets.
}
However, I saw a big performance impact due to this.但是,我发现这对性能产生了很大的影响。 I saw a big CPU spike due to this (as compared to one with just one sink).由于这个原因,我看到了一个很大的 CPU 峰值(与只有一个接收器的相比)。 The scale that I'm looking at is around 100k records per second.我正在查看的规模约为每秒 10 万条记录。
Other notes: I'm using bulk format writer since I want to write files in parquet format.其他注意事项:我正在使用批量格式编写器,因为我想以镶木地板格式编写文件。 I tried increasing the checkpointing interval from 1-minute to 3-minutes assuming writing files to s3 every minute might be causing issues.我尝试将检查点间隔从 1 分钟增加到 3 分钟,假设每分钟将文件写入 s3 可能会导致问题。 But this didn't help much.但这并没有多大帮助。
As I'm new to flink and stream processing, I'm not sure if this much performance impact is expected or is there something I can do better?由于我是 flink 和 stream 处理的新手,我不确定是否会产生如此大的性能影响,或者我是否可以做得更好? Would using a flatmap operator and then having a single sink be better?使用平面图运算符然后使用单个接收器会更好吗?
When you had a very simple pipeline with a single source and a single sink, something like this:当你有一个非常简单的管道,只有一个源和一个接收器时,就像这样:
source -> map -> sink
then the Flink scheduler was able to optimize the execution, and the entire pipeline ran as a sequence of function calls within a single task -- with no serialization or.network overhead.然后 Flink 调度程序能够优化执行,整个管道在单个任务中作为 function 调用的序列运行——没有序列化或网络开销。 Flink 1.12 can apply this operator chaining optimization to more complex topologies -- perhaps including the one you have now with multiple sinks -- but I don't believe this was possible with Flink 1.11 (which is what KDA is currently based on). Flink 1.12 可以将此运算符链优化应用于更复杂的拓扑——可能包括你现在拥有的具有多个接收器的拓扑——但我认为这在 Flink 1.11 中是不可能的(这是 KDA 目前所基于的)。
I don't see how using a flatmap would make any difference.我看不出使用平面图会有什么不同。
You can probably optimize your serialization/deserialization.您可能可以优化序列化/反序列化。 See https://flink.apache.org/news/2020/04/15/flink-serialization-tuning-vol-1.html .参见https://flink.apache.org/news/2020/04/15/flink-serialization-tuning-vol-1.html 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.