简体   繁体   English

Flink-事件时间滑动窗口,由于时间间隔,窗口中缺少数据

[英]Flink - event-time sliding window with missing data in window due to time gaps

Suppose I have a stream of stock market trading events, like this: 假设我有一连串的股票交易事件,如下所示:

technical1, ALXN, 1/1/2016
technical1, CELG, 1/1/2016
technical2, ALXN, 1/2/2016
technical2, CELG, 1/2/2016
. . . 
technicalN, ALXN, 4/1/2018
technicalN, CELG, 4/1/2018

such that technicalN (where N is some number) represents the Nth technical trading entry [Open (float), High(float), Low (float), Close (float), Volume (int)] of end-of-day daily stock market trading data for the given company. 这样,technicalN(其中N是某个数字)代表日末每日库存的第N个技术交易条目[开盘(浮动),高(浮动),低(浮动),收盘(浮动),交易量(整数)]给定公司的市场交易数据。 (ie technical1 for ticker GOOG is different than technical1 for ticker MSFT.) Like: (即,股票代码GOOG的technical1与股票代码MSFT的technical1不同。)

12.52, 19.25, 09.11, 17.54, 120532, GOOG, 1/1/2017
14.37, 29.52, 01.53, 12.96, 627156, MSFT, 1/1/2017

(Note that these trading prices/volumes are completely fictitious.) (请注意,这些交易价格/交易量是完全虚构的。)

Let's say that I want to create a window of size 2 with an interval of 1 day so that our data would look something like this: 假设我要创建一个大小为2的窗口,间隔为1天,以便我们的数据如下所示:

[technical1, GOOG, 12/26/2017; technical2, GOOG, 12/27/2017]
[technical1, MSFT, 12/26/2017; technical2, MSFT, 12/27/2017]
[technical2, GOOG, 12/27/2017; technical3, GOOG, 12/28/2017]
[technical2, MSFT, 12/27/2017; technical3, MSFT, 12/28/2017]
[technical3, GOOG, 12/28/2017; technical4, GOOG, 12/29/2017]
[technical3, MSFT, 12/28/2017; technical4, MSFT, 12/29/2017]
[technical4, GOOG, 12/29/2017; technical5, GOOG, 12/30/2017]
[technical4, MSFT, 12/29/2017; technical5, MSFT, 12/30/2017]
[technical5, GOOG, 12/30/2017; technical6, GOOG, 12/31/2017]
[technical5, MSFT, 12/30/2017; technical6, MSFT, 12/31/2017]
[technical6, GOOG, 12/31/2017; technical7, GOOG, 01/01/2018]
[technical6, MSFT, 12/31/2017; technical7, MSFT, 01/01/2018]
[technical7, GOOG, 01/01/2018; technical8, GOOG, 01/02/2018]
[technical7, MSFT, 01/01/2018; technical8, MSFT, 01/02/2018]
[technical8, GOOG, 01/02/2018; technical9, GOOG, 01/03/2018]
[technical8, MSFT, 01/02/2018; technical9, MSFT, 01/03/2018]
[. . .]
[technicalN, GOOG, 04/01/2018; technicalN+1, GOOG, 04/02/2018]
[technicalN, MSFT, 04/01/2018; technicalN+1, MSFT, 04/02/2018]
. . .

This would be nice, but it's problematic because stock market trading dates are not continuous... In other words, if I understand the mechanics of Flink correctly (and I could be wrong), the problem with using an event-time sliding window like this: 这会很好,但是会出现问题,因为股市交易日期不是连续的...换句话说,如果我正确理解了Flink的机制(可能是错误的),那么使用事件时间滑动窗口的问题就好像这个:

DataStream<T> input = ...;

// sliding event-time windows
input
.keyBy((TechnicalDataEntry technical) -> technical.ticker)
.window(SlidingEventTimeWindows.of(Time.day(2), Time.day(1))) // Window size of 2 days, sliding interval of 1 day
.<windowed transformation>(<window function>);

on data like that is that the date values are not continuous (meaning that they follow a discrete series that contains discontinuities of one or more missing days) because there is no stock market data for dates in which the stock market is closed, such as on holidays or weekends . 关于这样的数据,因为日期值不是连续的(意味着它们遵循包含一个或多个缺失天的不连续性的离散序列), 因为没有关于股票市场关闭日期的股票市场数据,例如假期或周末 So, with that in mind, our stream would actually end up looking more like this (because trading is closed on 12/30/2017, 12/31/2017, and 01/01/2018): 因此,考虑到这一点,我们的信息流实际上最终看起来像这样(因为交易在12/30 / 2017、12 / 31/2017和01/01/2018关闭):

[technical1, GOOG, 12/26/2017; technical2, GOOG, 12/27/2017]
[technical1, MSFT, 12/26/2017; technical2, MSFT, 12/27/2017]
[technical2, GOOG, 12/27/2017; technical3, GOOG, 12/28/2017]
[technical2, MSFT, 12/27/2017; technical3, MSFT, 12/28/2017]
[technical3, GOOG, 12/28/2017; technical4, GOOG, 12/29/2017]
[technical3, MSFT, 12/28/2017; technical4, MSFT, 12/29/2017]
[technical4, GOOG, 12/29/2017; NULL]
[technical4, MSFT, 12/29/2017; NULL]
[NULL; NULL]
[NULL; NULL]
[NULL; NULL]
[NULL; NULL]
[NULL; technical8, GOOG, 01/02/2018]
[NULL; technical8, MSFT, 01/02/2018]
[technical8, GOOG, 01/02/2018; technical9, GOOG, 01/03/2018]
[technical8, MSFT, 01/02/2018; technical9, MSFT, 01/03/2018]
[. . .]
[technicalN, GOOG, 04/01/2018; technicalN+1, GOOG, 04/02/2018]
[technicalN, MSFT, 04/01/2018; technicalN+1, MSFT, 04/02/2018]

How do I get my Flink stream to ignore the missing dates (and instead window or join or map together consecutive non-missing dates) so that my stream would look like this instead: 如何使Flink流忽略缺失的日期(而是将连续的非缺失日期与窗口或联接或映射在一起),以便使流看起来像这样:

[technical1, GOOG, 12/26/2017; technical2, GOOG, 12/27/2017]
[technical1, MSFT, 12/26/2017; technical2, MSFT, 12/27/2017]
[technical2, GOOG, 12/27/2017; technical3, GOOG, 12/28/2017]
[technical2, MSFT, 12/27/2017; technical3, MSFT, 12/28/2017]
[technical3, GOOG, 12/28/2017; technical4, GOOG, 12/29/2017]
[technical3, MSFT, 12/28/2017; technical4, MSFT, 12/29/2017]
[technical4, GOOG, 12/29/2017; technical5, GOOG, 01/02/2018]
[technical4, MSFT, 12/29/2017; technical5, MSFT, 01/02/2018]
[technical5, GOOG, 01/02/2018; technical6, GOOG, 01/03/2018]
[technical5, MSFT, 01/02/2018; technical6, MSFT, 01/03/2018]
[. . .]
[technicalN, GOOG, 04/01/2018; technicalN+1, GOOG, 04/02/2018]
[technicalN, MSFT, 04/01/2018; technicalN+1, MSFT, 04/02/2018]

?

(Note: please ignore the way that I'm incrementing the number by the string "technical" (like technical1, technical2, etc.) because, as I mentioned already, that value was just for descriptive purposes in this post and doesn't actually exist in the data. The only way to determine if two trading entries are consecutive is by grouping them by ticker and ordering them by trading date. Let's assume that no duplicate events exist.) (注意:请忽略我以字符串“ technical”(例如technical1,technical2等)递增数字的方式,因为正如我已经提到的那样,该值仅用于描述性目的,而并非实际上是否存在于数据中。确定两个交易条目是否连续的唯一方法是通过按行情自动收录器分组并按交易日期对它们进行排序。假设不存在重复事件。)

If I understand correctly your issue is that because there are certains periods when you're not receiving events then the windows won't behave properly since they don't know about the passage of time. 如果我正确理解您的问题,是因为在某些时段您没有收到事件,则窗口将无法正常运行,因为它们不了解时间的流逝。

One option you have is to peridiocally emit a watermark like so: 您拥有的一种选择是像这样周期性地发出水印:

streamEnvironment.addSource(new SourceFunction<Object>() {
        @Override
        public void run(final SourceContext<Object> ctx) {
            (...)

            ctx.emitWatermark(new Watermark(timestamp));
        }

        @Override
        public void cancel() {

        }
    })

Have in mind that if you receive events prior to the watermark they will be ignored so the periodicity of your watermark emission is a trade-off between "window accuracy" (firing as soon as they can) and being tolerant to late events. 请记住,如果您在水印之前收到事件,它们将被忽略,因此水印发射的周期是“窗口精度”(尽快触发)与对后期事件的容忍度之间的权衡。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM