简体   繁体   English

使用 Flink 同步处理 2 个流

[英]Process 2 streams in sync using Flink

I have 2 streams A and B.我有 2 个流 A 和 B。

I started consuming both A and B.我开始同时吃A和B。

Stream A gets a record only at 59th second of every minute. Stream A 仅在每分钟的第 59 秒获得记录。

Stream B get a record any second of the minute. Stream B 在每分钟的任何一秒都获得记录。

I want to process such that both the streams are in sync.我想进行处理以使两个流同步。

Example: From Stream A after 10:01:59 I will receive a record at 10:02:59, until 10:02:59 I don't want to read anything from Stream B as well.示例:从 Stream A 在 10:01:59 之后我将在 10:02:59 收到一条记录,直到 10:02:59 我也不想从 Stream B 中读取任何内容。

Can this be achieved in Flink?这可以在 Flink 中实现吗?

You can't not read records from a stream in Flink, but you can drop (or save in state) records from a stream.您无法在 Flink 中从 ZF7B44CFFAFD5C52223D5498196C8A2E7BZ 中读取记录,但您可以从 stream 中删除(或保存状态)记录。 So you could connect the two streams, and process with a CoFlatMap.因此,您可以连接两个流,并使用 CoFlatMap 进行处理。 When you get a record from Stream A, save it in state.当您从 Stream A 获得记录时,将其保存在 state 中。 When you get a record from Stream B, decide what to do with it, based on the state of Stream A.当您从 Stream B 获得记录时,根据 Stream A 的 state 决定如何处理它。

Flink uses a push based model (this should change soon as sources and sinks are being refactored to a pull based model) to process elements downstream. Flink 使用基于推送的 model(随着源和接收器被重构为基于拉的模型,这应该很快改变)来处理下游元素。 This means that you cannot "wait until an event has arrived to pull in more data", and you must buffer that in some operator state in the meanwhile.这意味着您不能“等到事件到达以提取更多数据”,并且您必须同时在某些运算符 state 中缓冲该数据。 Flink offers various state backends for you to utilize. Flink 提供各种 state 后端供您使用。

To give some visualization on kkruglers answer , given two streams, this is how we'd logically connect them and then use a ListState for one of them to be retrieved when another element arrives:为了对kkrugler answer进行一些可视化,给定两个流,这就是我们如何在逻辑上连接它们,然后使用ListState在另一个元素到达时检索其中一个:

import org.apache.flink.api.common.state.{ListState, ListStateDescriptor}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.co.CoProcessFunction
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._
import org.apache.flink.util.Collector

import scala.collection.JavaConverters._ 

object Test {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.createLocalEnvironment()
    val streamA = env.fromCollection(List(1, 2, 3))
    val streamB = env.fromCollection(List("a", "b", "c"))

    streamA
      .connect(streamB)
      .process {
        new CoProcessFunction[Int, String, (Int, String)] {
          var myStateA: ListState[Int] = _

          override def open(parameters: Configuration): Unit = {
            myStateA = getRuntimeContext.getListState[Int](
              new ListStateDescriptor[Int]("my_state", classOf[Int])
            )
          }

          override def processElement1(
              value: Int,
              ctx: CoProcessFunction[Int, String, (Int, String)]#Context,
              out: Collector[(Int, String)]
          ): Unit = {
            myStateA.add(value)
          }

          override def processElement2(
              value: String,
              ctx: CoProcessFunction[Int, String, (Int, String)]#Context,
              out: Collector[(Int, String)]
          ): Unit = {
            val list = myStateA.get().iterator().asScala.toList
            val intFromState = list.headOption
            intFromState match {
              case Some(myInt) =>
                out.collect((myInt, value))
              case None => ()
            }

            myStateA.update(list.tail.asJava)
          }
        }
      }
  }
}

Note this implementation is simplified.请注意,此实现已简化。 There is no guarantee here for the order of arrival of elements, which you'd need to add to your state and implementation.这里不能保证元素的到达顺序,您需要将其添加到 state 和实现中。 You could additionally use Timers , and thus register a timer per event that enters the stream, that as an indication for when new data has arrived.您还可以使用Timers ,从而为每个进入 ZF7B44CFFAFD5C52223D5498196C8A2E7BZ 的事件注册一个计时器,以指示新数据何时到达。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM