简体   繁体   English

Apache Flink交易汇总

[英]apache flink aggregation of transaction

I've been trying to figure out how to write a flink program that receives events, from 3 kafka's topics, sum them up for the today, yesterday, and the day before yesterday. 我一直试图找出如何编写一个flink程序来接收来自3 kafka主题的事件,并对今天,昨天和前天进行总结。

so the first question is, how can i sum the transaction for 3 different days and extract them as a json file 所以第一个问题是,我如何在3个不同的天对交易进行汇总并将其提取为json文件

If you want to read from 3 different kafka topics or partitions, you have to create 3 kafka sources 如果您想阅读3个不同的kafka主题或分区,则必须创建3个kafka来源

Flink's documentation about kafka consumer Flink关于Kafka消费者的文档

val env = StreamExecutionEnvironment.getExecutionEnvironment()
val consumer0 = new FlinkKafkaConsumer08[String](...)
val consumer1 = new FlinkKafkaConsumer08[String](...)
val consumer2 = new FlinkKafkaConsumer08[String](...)
consumer0.setStartFromGroupOffsets()
consumer1.setStartFromGroupOffsets()
consumer2.setStartFromGroupOffsets()

val stream0 = env.addSource(consumer0)
val stream1 = env.addSource(consumer1)
val stream2 = env.addSource(consumer2)

val unitedStream = stream0.union(stream1,stream2)

/* Logic to group transactions from 3 days */
/* I need more info, but it should be a Sliding or Fixed windows Keyed by the id of the transactions*/

val windowSize = 1 // number of days that the window use to group events
val windowStep = 1 // window slides 1 day

val reducedStream = unitedStream
    .keyBy("transactionId") // or any field that groups transactions in the same group
    .timeWindow(Time.days(windowSize),Time.days(windowStep))
    .map(transaction => {
        transaction.numberOfTransactions = 1
        transaction
    }).sum("numberOfTransactions");

val streamFormatedAsJson = reducedStream.map(functionToParseDataAsJson) 
// you can use a library like GSON for this
// or a scala string template

streamFormatedAsJson.sink(yourFavoriteSinkToWriteYourData)

If your topics names could be matched With a Regular expresion, you can create only one kafka consumer as follows: 如果您的主题名称可以与“常规”表达式匹配,则只能创建一个kafka使用者,如下所示:

val env = StreamExecutionEnvironment.getExecutionEnvironment()

val consumer = new FlinkKafkaConsumer08[String](
  java.util.regex.Pattern.compile("day-[1-3]"),
  ..., //check documentation to know how to fill this field
  ...) //check documentation to know how to fill this field

val stream = env.addSource(consumer)

Most common aproach is to have all transactions inside the same kafka topic and not in differents topics, in that case, the code will be more simple, because you only have to use a window to process your data 最常见的方法是将所有事务都放在同一个kafka主题中,而不是在differents主题中,在这种情况下,代码将更加简单,因为您只需要使用窗口来处理数据即可

Day 1 -> 11111 -\
Day 2 -> 22222 --> 1111122222333 -> Window -> 11111 22222 333 -> reduce operation per window partition
Day 3 -> 3333 --/                            |-----|-----|---|

Example code 范例程式码

val env = StreamExecutionEnvironment.getExecutionEnvironment()
val consumer = new FlinkKafkaConsumer08[String](...)
consumer.setStartFromGroupOffsets()

val stream = env.addSource(consumer)

/* Logic to group transactions from 3 days */
/* I need more info, but it should be a Sliding or Fixed windows Keyed by the id of the transactions*/

val windowSize = 1 // number of days that the window use to group events
val windowStep = 1 // window slides 1 day

val reducedStream = stream
    .keyBy("transactionId") // or any field that groups transactions in the same group
    .timeWindow(Time.days(windowSize),Time.days(windowStep))
    .map(transaction => {
        transaction.numberOfTransactions = 1
        transaction
    }).sum("numberOfTransactions");

val streamFormatedAsJson = reducedStream.map(functionToParseDataAsJson) 
// you can use a library like GSON for this
// or a scala string template

streamFormatedAsJson.sink(yourFavoriteSinkToWriteYourData)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM