结构化流-消耗每条消息

Question

What would be the "recommended" way to process each message as it comes through Structured streaming pipeline (im on spark 2.1.1 with source being Kafka 0.10.2.1) ? 处理通过结构化流水线传递的每条消息（在Spark 2.1.1上即时消息（其来源为Kafka 0.10.2.1））上的“推荐”处理方式是什么？

So far, I am looking at dataframe.mapPartitions (since i need to connect to HBase, whose client connection classes are not serizalable, hence mapPartitions ). 到目前为止，我正在研究dataframe.mapPartitions （因为我需要连接到HBase，其客户端连接类不可序列化，因此需要mapPartitions ）。

ideas ? 想法？

Answer 1

You should be able to use a foreach output sink: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks and https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach 您应该能够使用foreach输出接收器： https : //spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks和https://spark.apache.org/docs /latest/structured-streaming-programming-guide.html#using-foreach

Even though the client is not serializable, you don't have to open it in your ForeachWriter constructor. 即使客户端不可序列化，也不必在ForeachWriter构造函数中打开它。 Just leave it None/null, and initialize it in the open method, which is called after serialization, but only once per task. 只需将其保留为None / null，然后在open方法中对其进行初始化即可，该方法在序列化后会被调用，但每个任务只能执行一次。

In sort-of-pseudo-code: 在伪代码排序中：

class HBaseForeachWriter extends ForeachWriter[MyType] {
  var client: Option[HBaseClient] = None
  def open(partitionId: Long, version: Long): Boolean = {
    client = Some(... open a client ...)
  }
  def process(record: MyType) = {
    client match {
      case None => throw Exception("shouldn't happen")
      case Some(cl) => {
        ... use cl to write record ...
      }
    }
  }
  def close(errorOrNull: Throwable): Unit = {
    client.foreach(cl => cl.close())
  }
}

结构化流-消耗每条消息

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-05-25 21:17:55

结构化流-消耗每条消息

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-05-25 21:17:55

解决方案1
2 已采纳 2017-05-25 21:17:55