简体   繁体   English

在Apache FLink中的不同窗口上计算指标

[英]Compute metrics on different window in Apache FLink

I am using Apache Flink 1.2 and here's my question: I have a stream of data and I would like to compute a metric over a window of 1 day. 我正在使用Apache Flink 1.2,这是我的问题:我有一个数据流,我想在1天的时间范围内计算一个指标。 Therefore I will write something like: 因此,我将编写如下内容:

DataStream<Tuple6<Timestamp, String, Double, Double, Double, Integer>> myStream0 = 
            env.readTextFile("Myfile.csv")
            .map(new MyMapper())                // Parse the input
            .assignTimestampsAndWatermarks(new MyExtractor())   //Assign the timestamp of the event 
            .timeWindowAll(Time.days(1))    
            .apply(new average());  // compute average, max, sum

Now I would like to compute the same metrics over a window of 1 hour. 现在,我想在1小时的时间内计算出相同的指标。

I can write same as before and specify Time.hours(1), but my concerns is that in this way apache flink reads two times the input file and does twice the work. 我可以像以前一样编写并指定Time.hours(1),但是我担心的是,以这种方式apache flink读取了两倍的输入文件,并且完成了两倍的工作。 I wonder if there is a way of doing all togheter (ie using the same stream). 我想知道是否有一种方法可以完成所有的Togheter(即使用相同的流)。

You can compute hourly aggregates and from those the daily aggregates. 您可以计算小时汇总,也可以从这些每日汇总中计算。 This would look for a simple DataStream<Double> as follows: 这将寻找一个简单的DataStream<Double> ,如下所示:

DataStream<Double> vals = ... // source + timestamp extractor

DataStream<Tuple2<Double, Long>> valCnt = vals // (sum, cnt)
  .map(new CntAppender())     // Double -> Tuple2<Double, Long(1)>

DataStream<Tuple3<Double, Long, Long>> hourlySumCnt = valCnt // (sum, cnt, endTime)
  .timeWindowAll(Time.hours(1))
  // SumCounter ReduceFunction sums the Double and Long field (Long is Count)
  // WindowEndAppender WindowFunction adds the window end timestamp (3rd field)
  .reduce(new SumCounter(), new WindowEndAppender())   

DataStream<Tuple2<Double, Long>> hourlyAvg = hourlySumCnt // (avg, endTime)
  .map(new SumDivCnt()) // MapFunction divides Sum by Cnt for average

DataStream<Tuple3<Double, Long, Long>> dailySumCnt = hourlySumCnt // (sum, cnt, endTime)
  .map(new StripeOffTime()) // removes unnecessary time field -> Tuple2<Double, Long>
  .timeWindowAll(Time.days(1))
  .reduce(new SumCounter(), new WindowEndAppender()) // same as above

DataStream<Tuple2<Double, Long>> dailyAvg = dailySumCnt // (avg, endTime)
  .map(new SumDivCnt()) // same as above

So, you basically compute sum and count for each hour, and based on that result you 因此,您基本上可以计算每个小时的总和和计数,然后根据该结果,

  1. compute the hourly average 计算每小时平均值
  2. compute daily sum and count and the daily average 计算每日总和和计数以及每日平均值

Note, that I am using a ReduceFunction instead of a WindowFunction for the sum and count computation, because a ReduceFunction is eagerly applied, ie, all records of the window are not collected but immediately aggregated. 请注意,我正在使用ReduceFunction而不是WindowFunction进行求和和计数,因为急切地应用了ReduceFunction ,即未收集窗口的所有记录,而是立即对其进行汇总。 Hence the state that needs to be maintained is a single record. 因此,需要维护的状态是单个记录。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM