如何在spark executor處收集所有記錄並將其作為批處理進行處理

Question

在我的 spark kinesis 流應用程序中，我使用 foreachBatch 來獲取流數據並需要將其發送到 drools 規則引擎以進行進一步處理。

我的要求是，我需要在列表/規則會話中累積所有 json 數據，並將其發送給規則引擎，以便在執行程序端作為批處理進行處理。

//Scala Code Example:
    val dataFrame = sparkSession.readStream
          .format("kinesis")
          .option("streamName", streamName)
          .option("region", region)
          .option("endpointUrl",endpointUrl)
          .option("initialPosition", "TRIM_HORIZON")
          .load()


    val query = dataFrame
        .selectExpr("CAST(data as STRING) as krecord")
        .writeStream
        .foreachBatch(function)
        .start()

    query.awaitTermination()


     val function = (batchDF: DataFrame, batchId: Long) => {
       val ruleSession = kBase.newKieSession() //Drools Rule Session, this is getting created at driver side

         batchDF.foreach(row => { // This piece of code is being run in executor.
               val jsonData: JSONData = jsonHandler.convertStringToJSONType(row.mkString)
               ruleSession.insert(jsonData) // Getting a null pointer exception here as the ruleSession is not available in executor.
             }
           )

       ruleHandler.processRule(ruleSession) // Again this is in the driver scope.
      }

在上面的代碼中，我面臨的問題是：foreachBatch 中使用的函數在驅動程序端執行，batchDF.foreach 中的代碼在工作程序/執行程序端執行，因此無法獲取規則會話。

有沒有辦法在每個執行器端運行整個函數？

或者

有沒有更好的方法在轉換后在批處理 DataFrame 中累積所有數據並將其從執行器/工作器內部發送到下一個進程？

Answer 1

我認為這可能有效......而不是運行 foreach，你可以使用 foreachBatch 或 foreachPartition（或者，如果你想要返回信息，或者像 mapPartition 這樣的地圖版本）。 在這部分中，打開與drools 系統的連接。 從那時起，遍歷每個分區（或批處理）中的數據集，將每個分區發送到 drools 系統（或者您可以將整個塊發送到 drools）。 在 foreachPartition / foreachBatch 部分的最后，關閉連接（如果適用）。

Answer 2

@codeaperature，這就是我實現批處理的方式，受您的回答啟發，將其作為答案發布，因為這超出了評論中的字數限制。

在數據幀上使用 foreach 並傳入 ForeachWriter。
在 ForeachWriter 的 open 方法中初始化規則會話。
將每個輸入 JSON 添加到 process 方法中的規則會話。
使用加載了批量數據的規則會話在 close 方法中執行規則。

//斯卡拉代碼：

val dataFrame = sparkSession.readStream
          .format("kinesis")
          .option("streamName", streamName)
          .option("region", region)
          .option("endpointUrl",endpointUrl)
          .option("initialPosition", "TRIM_HORIZON")
          .load()

val query = dataFrame
  .selectExpr("CAST(data as STRING) as krecord")
  .writeStream
  .foreach(dataConsumer)
  .start()

val dataConsumer = new ForeachWriter[Row] {

  var ruleSession: KieSession = null;

  def open(partitionId: Long, version: Long): Boolean = { // first open is called once for every batch
    ruleSession = kBase.newKieSession()
    true
  }

  def process(row: Row) = { // the process method will be called for a batch of records
    val jsonData: JSONData = jsonHandler.convertStringToJSONType(row.mkString)
    ruleSession.insert(jsonData) // Add all input json to rule session.
  }

  def close(errorOrNull: Throwable): Unit = { // after calling process for all records in bathc close is called
    val factCount = ruleSession.getFactCount
    if (factCount > 0) {
      ruleHandler.processRule(ruleSession) //batch processing of rule
    }
  }
}

如何在spark executor處收集所有記錄並將其作為批處理進行處理

問題描述

2 個解決方案

解決方案1
0 已采納 2020-02-12 00:15:19

解決方案2
0 2020-02-13 21:21:24

如何在spark executor處收集所有記錄並將其作為批處理進行處理

問題描述

2 個解決方案

解決方案1 0 已采納 2020-02-12 00:15:19

解決方案2 0 2020-02-13 21:21:24

解決方案1
0 已采納 2020-02-12 00:15:19

解決方案2
0 2020-02-13 21:21:24