简体   繁体   English

如何使用历史数据集丰富Flink DataStream

[英]How to use historical dataset for enriching Flink DataStream

I am working on a real-time project with Flink and I need to enrich the state of each card with prior transactions for computing transactions features as below: 我正在使用Flink进行实时项目,我需要使用先前的交易来丰富每张卡的状态,以计算交易特征,如下所示:

For each card I have a feature that counts the number of transactions in the past 24 hours. 对于每张卡,我都有一项功能,可以统计过去24小时内的交易次数。 On the other hand I have 2 data sources: 另一方面,我有2个数据源:

First, a database table which stores the transactions of cards until the end of yesterday. 首先,一个数据库表,存储卡的交易,直到昨天结束。

Second, the stream of today's transactions. 第二,今天的交易流。

So the first step is to fetch the yesterday transactions of each card from database and store them in card state. 因此,第一步是从数据库中获取每张卡的昨天交易并将其存储在卡状态下。 Then the second step is to update this state with today's transactions which come on stream and compute the number of transactions in the past 24 hours for them. 然后,第二步是使用正在运行的今天的交易更新此状态,并计算过去24小时内交易的交易数量。 I tried to read the database data as a stream and connect it to the today transactions. 我试图将数据库数据作为流读取,并将其连接到今天的事务。 So, to reach above goal, I used RichFlatMap function. 因此,为了达到上述目标,我使用了RichFlatMap函数。 However, because the database data was not stream inherently, the output was not correct. 但是,由于数据库数据不是固有流,因此输出不正确。 RichFlatMap function is in following: RichFlatMap函数位于以下位置:

transactionsHistory.connect(transactionsStream).flatMap(new         
RichCoFlatMapFunction<History, Tuple2<String, Transaction>,         
ExtractedFeatures>() {
    private ValueState<History> history;
    @Override
    public void open(Configuration config) throws Exception {
        this.history = getRuntimeContext().getState(new 
    ValueStateDescriptor<>("card history", History.class));
    }
    //historical data 
    @Override
    public void flatMap1(History history, 
    Collector<ExtractedFeatures> collector) throws Exception {
        this.history.update(history);
    }
    //new transactions from stream 
    @Override
    public void flatMap2(Tuple2<String, Transaction> 
    transactionTuple, Collector<ExtractedFeatures> collector) throws 
    Exception {
        History history = this.history.value();
        Transaction transaction = transactionTuple.f1;
        ArrayList<History> prevDayHistoryList = 
        history.prevDayTransactions;

        // This function returns transactions which are in 24 hours 
        //window of the current transaction and their count.
        Tuple2<ArrayList<History>, Integer> prevDayHistoryTuple = 
        findHistoricalDate(prevDayHistoryList,
                transaction.transactionLocalDate);
        prevDayHistoryList = prevDayHistoryTuple.f0;
        history.prevDayTransactions = prevDayHistoryList;
        this.history.update(history);
        ExtractedFeatures ef = new ExtractedFeatures();
        ef.updateFeatures(transaction, prevDayHistoryTuple.f1);
        collector.collect(ef);
    }
});

What is the right design pattern to achieve the above enriching requirement in a Flink streaming program? 在Flink流媒体程序中,什么功能可以达到上述丰富要求? I found the blow question on stack overflow which is similar to my question but I couldn't solve my problem so I decided to ask for help :) 我发现堆栈溢出的打击问题与我的问题类似,但我无法解决问题,所以我决定寻求帮助:)

Enriching DataStream using static DataSet in Flink streaming 在Flink流中使用静态DataSet丰富DataStream

Any help would be really appreciated. 任何帮助将非常感激。

However, because the database data was not stream inherently, the output was not correct. 但是,由于数据库数据不是固有流,因此输出不正确。

It certainly is possible to enrich streaming data with information coming from a relational database. 当然,可以使用来自关系数据库的信息来丰富流数据。 What can be tricky, though, is to somehow guarantee that the enrichment data is ingested before it is needed. 但是,棘手的是要以某种方式确保在需要之前提取丰富数据。 In general you may need to buffer the stream to be enriched until the enrichment data has been bootstrapped/ingested. 通常,您可能需要缓冲要丰富的流,直到丰富数据被引导/摄取为止。 One approach that is sometimes taken, for example, is to 例如,有时采取的一种方法是

  1. run the app with the stream-to-be-enriched disabled 在禁用要丰富的流的情况下运行应用
  2. take a savepoint once the enrichment data has been fully ingested and stored in flink state 一旦充分吸收了浓缩数据并将其存储在flink状态下,请保存一个保存点
  3. restart the app from the savepoint with the stream-to-be-enriched enabled 在启用要丰富流的情况下从保存点重新启动应用程序

In the case you describe, however, it seems like a simpler approach would work. 但是,在您描述的情况下,似乎更简单的方法可行。 If you only need 24 hours of historic data, then why not ignore the database of historic transactions? 如果您只需要24小时的历史数据,那为什么不忽略历史交易数据库呢? Just run your application until it has seen 24 hours of streaming data, after which the historic database becomes irrelevant anyway. 只需运行您的应用程序,直到看到24小时的流数据,此后历史数据库就变得无关紧要了。

But if you must ingest the historic data, and you don't like the savepoint-based approach outlined above, here are a couple of other possibilities: 但是,如果您必须提取历史数据,并且您不喜欢上面概述的基于保存点的方法,则还有其他两种可能:

  • buffer the un-enriched events in flink state (eg ListState or MapState) until the historic stream has been ingested 以flink状态(例如ListState或MapState)缓冲未丰富的事件,直到已吸收历史流
  • write a custom SourceFunction that blocks the primary stream until the historic data has been ingested 编写一个自定义SourceFunction来阻止主流,直到提取历史数据为止

For a more thorough exploration of this topic, see Bootstrapping State In Apache Flink . 有关此主题的更详尽的探索,请参阅Apache Flink中的引导状态

Better support for this use case is planned for a future release, btw. 计划在将来的版本btw中为该用例提供更好的支持。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM