[英]Structured Streaming - Consume each message
What would be the "recommended" way to process each message as it comes through Structured streaming pipeline (im on spark 2.1.1 with source being Kafka 0.10.2.1) ? 处理通过结构化流水线传递的每条消息(在Spark 2.1.1上即时消息(其来源为Kafka 0.10.2.1))上的“推荐”处理方式是什么?
So far, I am looking at dataframe.mapPartitions
(since i need to connect to HBase, whose client connection classes are not serizalable, hence mapPartitions
). 到目前为止,我正在研究dataframe.mapPartitions
(因为我需要连接到HBase,其客户端连接类不可序列化,因此需要mapPartitions
)。
ideas ? 想法?
You should be able to use a foreach
output sink: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks and https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach 您应该能够使用foreach
输出接收器: https : //spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks和https://spark.apache.org/docs /latest/structured-streaming-programming-guide.html#using-foreach
Even though the client is not serializable, you don't have to open it in your ForeachWriter
constructor. 即使客户端不可序列化,也不必在ForeachWriter
构造函数中打开它。 Just leave it None/null, and initialize it in the open
method, which is called after serialization, but only once per task. 只需将其保留为None / null,然后在open
方法中对其进行初始化即可,该方法在序列化后会被调用,但每个任务只能执行一次。
In sort-of-pseudo-code: 在伪代码排序中:
class HBaseForeachWriter extends ForeachWriter[MyType] {
var client: Option[HBaseClient] = None
def open(partitionId: Long, version: Long): Boolean = {
client = Some(... open a client ...)
}
def process(record: MyType) = {
client match {
case None => throw Exception("shouldn't happen")
case Some(cl) => {
... use cl to write record ...
}
}
}
def close(errorOrNull: Throwable): Unit = {
client.foreach(cl => cl.close())
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.