结构化Spark流式处理多次写入

Question

I am using a data stream to be written to a kafka topic as well as hbase. 我正在使用数据流以及hbase来写一个kafka主题。 For Kafka, I use a format as this: 对于Kafka，我使用以下格式：

dataset.selectExpr("id as key", "to_json(struct(*)) as value")
        .writeStream.format("kafka")
        .option("kafka.bootstrap.servers", Settings.KAFKA_URL)
        .option("topic", Settings.KAFKA_TOPIC2)
        .option("checkpointLocation", "/usr/local/Cellar/zookeepertmp")
        .outputMode(OutputMode.Complete())
        .start()

and then for Hbase, I do something like this: 然后对于Hbase，我做这样的事情：

  dataset.writeStream.outputMode(OutputMode.Complete())
    .foreach(new ForeachWriter[Row] {
      override def process(r: Row): Unit = {
        //my logic
      }

      override def close(errorOrNull: Throwable): Unit = {}

      override def open(partitionId: Long, version: Long): Boolean = {
        true
      }
    }).start().awaitTermination()

This writes to Hbase as expected but doesn't always write to the kafka topic. 这将按预期方式写入Hbase，但并不总是写入kafka主题。 I am not sure why that is happening. 我不确定为什么会这样。

Answer 1

Use foreachBatch in spark: 在spark中使用foreachBatch ：

If you want to write the output of a streaming query to multiple locations, then you can simply write the output DataFrame/Dataset multiple times. 如果要将流查询的输出写入多个位置，则可以简单地多次写入输出DataFrame / Dataset。 However, each attempt to write can cause the output data to be recomputed (including possible re-reading of the input data). 但是，每次写入尝试都可能导致重新计算输出数据（包括可能重新读取输入数据）。 To avoid recomputations, you should cache the output DataFrame/Dataset, write it to multiple locations, and then uncache it. 为了避免重新计算，您应该缓存输出DataFrame / Dataset，将其写入多个位置，然后取消缓存。 Here is an outline. 这是一个轮廓。

streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
    batchDF.persist()
    batchDF.write.format(…).save(…) // location 1
    batchDF.write.format(…).save(…) // location 2
    batchDF.unpersist()
}

结构化Spark流式处理多次写入

问题描述

1 个解决方案

解决方案1
0 2019-07-08 12:29:24

结构化Spark流式处理多次写入

问题描述

1 个解决方案

解决方案1 0 2019-07-08 12:29:24

解决方案1
0 2019-07-08 12:29:24