简体   繁体   中英

How to use historical dataset for enriching Flink DataStream

I am working on a real-time project with Flink and I need to enrich the state of each card with prior transactions for computing transactions features as below:

For each card I have a feature that counts the number of transactions in the past 24 hours. On the other hand I have 2 data sources:

First, a database table which stores the transactions of cards until the end of yesterday.

Second, the stream of today's transactions.

So the first step is to fetch the yesterday transactions of each card from database and store them in card state. Then the second step is to update this state with today's transactions which come on stream and compute the number of transactions in the past 24 hours for them. I tried to read the database data as a stream and connect it to the today transactions. So, to reach above goal, I used RichFlatMap function. However, because the database data was not stream inherently, the output was not correct. RichFlatMap function is in following:

transactionsHistory.connect(transactionsStream).flatMap(new         
RichCoFlatMapFunction<History, Tuple2<String, Transaction>,         
ExtractedFeatures>() {
    private ValueState<History> history;
    @Override
    public void open(Configuration config) throws Exception {
        this.history = getRuntimeContext().getState(new 
    ValueStateDescriptor<>("card history", History.class));
    }
    //historical data 
    @Override
    public void flatMap1(History history, 
    Collector<ExtractedFeatures> collector) throws Exception {
        this.history.update(history);
    }
    //new transactions from stream 
    @Override
    public void flatMap2(Tuple2<String, Transaction> 
    transactionTuple, Collector<ExtractedFeatures> collector) throws 
    Exception {
        History history = this.history.value();
        Transaction transaction = transactionTuple.f1;
        ArrayList<History> prevDayHistoryList = 
        history.prevDayTransactions;

        // This function returns transactions which are in 24 hours 
        //window of the current transaction and their count.
        Tuple2<ArrayList<History>, Integer> prevDayHistoryTuple = 
        findHistoricalDate(prevDayHistoryList,
                transaction.transactionLocalDate);
        prevDayHistoryList = prevDayHistoryTuple.f0;
        history.prevDayTransactions = prevDayHistoryList;
        this.history.update(history);
        ExtractedFeatures ef = new ExtractedFeatures();
        ef.updateFeatures(transaction, prevDayHistoryTuple.f1);
        collector.collect(ef);
    }
});

What is the right design pattern to achieve the above enriching requirement in a Flink streaming program? I found the blow question on stack overflow which is similar to my question but I couldn't solve my problem so I decided to ask for help :)

Enriching DataStream using static DataSet in Flink streaming

Any help would be really appreciated.

However, because the database data was not stream inherently, the output was not correct.

It certainly is possible to enrich streaming data with information coming from a relational database. What can be tricky, though, is to somehow guarantee that the enrichment data is ingested before it is needed. In general you may need to buffer the stream to be enriched until the enrichment data has been bootstrapped/ingested. One approach that is sometimes taken, for example, is to

  1. run the app with the stream-to-be-enriched disabled
  2. take a savepoint once the enrichment data has been fully ingested and stored in flink state
  3. restart the app from the savepoint with the stream-to-be-enriched enabled

In the case you describe, however, it seems like a simpler approach would work. If you only need 24 hours of historic data, then why not ignore the database of historic transactions? Just run your application until it has seen 24 hours of streaming data, after which the historic database becomes irrelevant anyway.

But if you must ingest the historic data, and you don't like the savepoint-based approach outlined above, here are a couple of other possibilities:

  • buffer the un-enriched events in flink state (eg ListState or MapState) until the historic stream has been ingested
  • write a custom SourceFunction that blocks the primary stream until the historic data has been ingested

For a more thorough exploration of this topic, see Bootstrapping State In Apache Flink .

Better support for this use case is planned for a future release, btw.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM