简体   繁体   中英

Enriching one stream with another stream

I want to use Apache Flink for the following. I have one main stream that has to be enriched by the data of another stream. This main stream has elements with attributes "site" and "timestamp". The other stream (let's call it countrystream) has the attributes "site" and "country". The countrystream should keep track of the latest country used for a site. For example, if ("klm.com", "netherlands") arrived first and some time later the tuple ("klm.com", "france") arrived, then "klm.com" should point to "france" (since this was the latter one). So, it should maintain a state. Suppose a tuple ("klm.com", 100) arrived at the main stream. This should now be enriched to ("klm.com", 100, "france") . If some site is not found in the countrystream, it should be enriched with "?". So for example, ("stackoverflow.com", 150, "?") . How can I archieve this?

I found a solution (it took me some time). Is this efficient? Can it be improved? Does it mean that I cannot have checkpoints for my iterative stream?

val env = StreamExecutionEnvironment.getExecutionEnvironment

val mainStream = env.fromElements("a", "a", "b", "a", "a", "b", "b", "a", "c", "b", "a", "c")
val infoStream = env.fromElements((1, "a", "It is F"), (2, "b", "It is B"), (3, "c", "It is C"), (4, "a", "Whoops, it is A"))
        .iterate(
            iteration => {
                (iteration, iteration)
            }
        )

mainStream
    .coGroup(infoStream)
        .where[String]((x: String) => x)
        .equalTo(_._2)
        .window(TumblingProcessingTimeWindows.of(Time.seconds(1))) {
            (first: Iterator[String], second: Iterator[(Int, String, String)], out: Collector[(String, String)]) => {
                first.foreach((key: String) => {
                        val matchingRecords = second
                            .filter(_._2 == key)
                        if (matchingRecords.nonEmpty) {
                            val matchingRecord = matchingRecords.maxBy(_._1)
                            out.collect((matchingRecord._2, matchingRecord._3))
                        }
                    }
                )
            }
        }
    .print()

env.execute("proof_of_concept")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM