Spark Structured Streaming ForeachWriter和數據庫性能

Question

我已經實現了像這樣的結構化流...

myDataSet
  .map(r =>  StatementWrapper.Transform(r))
  .writeStream
  .foreach(MyWrapper.myWriter)
  .start()
  .awaitTermination()

這一切似乎都有效，但看看MyWrapper.myWriter的吞吐量是可怕的。 它有效地嘗試成為JDBC接收器，它看起來像：

val myWriter: ForeachWriter[Seq[String]] = new ForeachWriter[Seq[String]] {

  var connection: Connection = _

  override def open(partitionId: Long, version: Long): Boolean = {
    Try (connection = getRemoteConnection).isSuccess
  }

  override def process(row: Seq[String]) {
    val statement = connection.createStatement()
    try {
      row.foreach( s => statement.execute(s) )
    } catch {
      case e: SQLSyntaxErrorException => println(e)
      case e: SQLException => println(e)
    } finally {
      statement.closeOnCompletion()
    }
  }

  override def close(errorOrNull: Throwable) {
    connection.close()
  }
}

所以我的問題是 - 新的ForeachWriter是否為每一行實例化？ 因此，對數據集中的每一行調用open（）和close（）？

是否有更好的設計來提高吞吐量？

如何解析SQL語句一次並執行多次，同時保持數據庫連接打開？

Answer 1

打開和關閉底層接收器取決於您對ForeachWriter 的實現 。

調用ForeachWriter的相關類是ForeachSink ，這是調用你的編寫器的代碼：

data.queryExecution.toRdd.foreachPartition { iter =>
  if (writer.open(TaskContext.getPartitionId(), batchId)) {
    try {
      while (iter.hasNext) {
        writer.process(encoder.fromRow(iter.next()))
      }
    } catch {
      case e: Throwable =>
        writer.close(e)
        throw e
    }
    writer.close(null)
  } else {
    writer.close(null)
  }
}

嘗試打開和關閉作者，從源生成的foreach批處理。 如果你想要open和close每次打開並關閉接收器驅動程序，你需要通過你的實現這樣做。

如果您想要更好地控制數據的處理方式，可以實現Sink trait，它提供批處理ID和基礎DataFrame ：

trait Sink {
  def addBatch(batchId: Long, data: DataFrame): Unit
}

Spark Structured Streaming ForeachWriter和數據庫性能

問題描述

1 個解決方案

解決方案1
10 2017-10-19 07:28:27

Spark Structured Streaming ForeachWriter和數據庫性能

問題描述

1 個解決方案

解決方案1 10 2017-10-19 07:28:27

解決方案1
10 2017-10-19 07:28:27