简体   繁体   English

单源多汇 v/s 平面图

[英]Single source multiple sinks v/s flatmap

I'm using Kinesis Data Analytics on Flink to do stream processing.我在 Flink 上使用 Kinesis Data Analytics 进行 stream 处理。
The usecase that I'm working on is to read records from a single Kinesis stream and after some transformations write to multiple S3 buckets.我正在处理的用例是从单个 Kinesis stream 读取记录,并在进行一些转换后写入多个 S3 存储桶。 One source record might end up in multiple S3 buckets.一条源记录可能会出现在多个 S3 存储桶中。 We need to write to multiple buckets since the source record contains a lot of information which needs to be split to multiple S3 buckets.我们需要写入多个存储桶,因为源记录包含大量需要拆分到多个 S3 存储桶的信息。

I tried achieving this using multiple sinks.我尝试使用多个接收器来实现这一点。

private static <T> SinkFunction<T> createS3SinkFromStaticConfig(String path, Class<T> type) {
        OutputFileConfig config = OutputFileConfig
                .builder()
                .withPartSuffix(".snappy.parquet")
                .build();


        final StreamingFileSink<T> sink = StreamingFileSink
                .forBulkFormat(new Path(s3SinkPath + "/" + path), createParquetWriter(type))
                .withBucketAssigner(new S3BucketAssigner<T>())
                .withOutputFileConfig(config)
                .withRollingPolicy(new RollingPolicy<T>(DEFAULT_MAX_PART_SIZE, DEFAULT_ROLLOVER_INTERVAL))
                .build();
        return sink;
}

public static void main(String[] args) throws Exception {
    DataStream<PIData> input = createSourceFromStaticConfig(env)
        .map(new JsonToSourceDataMap())
        .name("jsonToInputDataTransformation");


    input.map(value -> value)
        .name("rawData")
        .addSink(createS3SinkFromStaticConfig("raw_data", InputData.class))
        .name("s3Sink");

     input.map(FirstConverter::convertInputData)
        .addSink(createS3SinkFromStaticConfig("firstOutput", Output1.class));

    input.map(SecondConverter::convertInputData)
        .addSink(createS3SinkFromStaticConfig("secondOutput", Output2.class));

    input.map(ThirdConverter::convertInputData)
        .addSink(createS3SinkFromStaticConfig("thirdOutput", Output3.class));

    //and so on; There are around 10 buckets.
}

However, I saw a big performance impact due to this.但是,我发现这对性能产生了很大的影响。 I saw a big CPU spike due to this (as compared to one with just one sink).由于这个原因,我看到了一个很大的 CPU 峰值(与只有一个接收器的相比)。 The scale that I'm looking at is around 100k records per second.我正在查看的规模约为每秒 10 万条记录。

Other notes: I'm using bulk format writer since I want to write files in parquet format.其他注意事项:我正在使用批量格式编写器,因为我想以镶木地板格式编写文件。 I tried increasing the checkpointing interval from 1-minute to 3-minutes assuming writing files to s3 every minute might be causing issues.我尝试将检查点间隔从 1 分钟增加到 3 分钟,假设每分钟将文件写入 s3 可能会导致问题。 But this didn't help much.但这并没有多大帮助。

As I'm new to flink and stream processing, I'm not sure if this much performance impact is expected or is there something I can do better?由于我是 flink 和 stream 处理的新手,我不确定是否会产生如此大的性能影响,或者我是否可以做得更好? Would using a flatmap operator and then having a single sink be better?使用平面图运算符然后使用单个接收器会更好吗?

When you had a very simple pipeline with a single source and a single sink, something like this:当你有一个非常简单的管道,只有一个源和一个接收器时,就像这样:

source -> map -> sink

then the Flink scheduler was able to optimize the execution, and the entire pipeline ran as a sequence of function calls within a single task -- with no serialization or.network overhead.然后 Flink 调度程序能够优化执行,整个管道在单个任务中作为 function 调用的序列运行——没有序列化或网络开销。 Flink 1.12 can apply this operator chaining optimization to more complex topologies -- perhaps including the one you have now with multiple sinks -- but I don't believe this was possible with Flink 1.11 (which is what KDA is currently based on). Flink 1.12 可以将此运算符链优化应用于更复杂的拓扑——可能包括你现在拥有的具有多个接收器的拓扑——但我认为这在 Flink 1.11 中是不可能的(这是 KDA 目前所基于的)。

I don't see how using a flatmap would make any difference.我看不出使用平面图会有什么不同。

You can probably optimize your serialization/deserialization.您可能可以优化序列化/反序列化。 See https://flink.apache.org/news/2020/04/15/flink-serialization-tuning-vol-1.html .参见https://flink.apache.org/news/2020/04/15/flink-serialization-tuning-vol-1.html

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 是否可以在不使用侧输入的情况下在不同的 Output 接收器上编写单个 Pcollection? - Is it possible to write a single Pcollection at different Output sinks without using side inputs? 无法在单个 Cloudfront Distribution 上为多个 S3 存储桶提供服务 - Not able to server multiple S3 buckets on a single Cloudfront Distribution 如何从单个 docker 容器与多个 S3 存储桶进行交互? - How to interact with multiple S3 bucket from a single docker container? 如何在单一来源的多个 go 服务中生成 grpc 代码为 package? - how to generate grpc code as package in multiple go service from single source? 我可以将数据从多个 S3 存储桶复制到单个 S3 存储桶吗? - Can I replicate data from Multiple S3 buckets to a single S3 bucket? 通过单个 lambda function 将s3中托管的文件内容解压到多个cloudfront url - Unzip file content hosted in s3 to multiple cloudfront url through a single lambda function AWS Cloudwatch Logs - 在单个 S3 导出日志作业中需要多个日志组 - AWS Cloudwatch Logs - Require Multiple Log Groups in single S3 Export Log Job 我们如何在单个 S3 存储桶中托管多个 HTML 文件以获取每个文件的单独 URL? - How do we host multiple HTML files in a single S3 bucket to get separate URLs for each file? 如何将多个s3 bucket资源导入到单个terraform资源名称 - how to import multiple s3 bucket resources to single terraform resource name 有没有办法从 aws s3 获取 stream 文件,但将多个 mp3 文件用作单个 stream? - Is there a way to stream files from aws s3 but using multiple mp3 files as a single stream?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM