![](/img/trans.png)
[英]Apache Spark Broadcast variables are type Broadcast? Not a RDD?
[英]Apache Flink: transforming Broadcast variables fails, but I can't determine why
我正在嘗試在 Apache Flink 上准備一個小示例應用程序,主要目的是演示如何使用廣播變量。 此應用程序讀取 CSV 文件並准備數據集[BuildingInformation]
case class BuildingInformation(
buildingID: Int, buildingManager: String, buildingAge: Int,
productID: String, country: String
)
這就是我目前正在創建BuildingInformation DataSet 的方式:
val buildingsBroadcastSet =
envDefault
.fromElements(
readBuildingInfo(
envDefault,
"./SensorFiles/building.csv")
)
然后,我開始轉換:
val hvacStream = readHVACReadings(envDefault,"./SensorFiles/HVAC.csv")
hvacStream
.map(new HVACToBuildingMapper)
.withBroadcastSet(buildingsBroadcastSet,"buildingData")
.writeAsCsv("./hvacTemp.csv")
( buildingID -> BuildingInformation )的地圖是我想要的作為廣播的參考數據。 為了做好准備,我實現了一個 RichMapFunction:
class HVACToBuildingMapper
extends RichMapFunction [HVACData,EnhancedHVACTempReading] {
var allBuildingDetails: Map[Int, BuildingInformation] = _
override def open(configuration: Configuration): Unit = {
allBuildingDetails =
getRuntimeContext
.getBroadcastVariableWithInitializer(
"buildingData",
new BroadcastVariableInitializer [BuildingInformation,Map[Int,BuildingInformation]] {
def initializeBroadcastVariable(valuesPushed:java.lang.Iterable[BuildingInformation]): Map[Int,BuildingInformation] = {
valuesPushed
.asScala
.toList
.map(nextBuilding => (nextBuilding.buildingID,nextBuilding))(breakOut)
}
}
)
}
override def map(nextReading: HVACData): EnhancedHVACTempReading = {
val buildingDetails = allBuildingDetails.getOrElse(nextReading.buildingID,UndefinedBuildingInformation)
// ... more intermediate data creation logic here
EnhancedHVACTempReading(
nextReading.buildingID,
rangeOfTempRecorded,
isExtremeTempRecorded,
buildingDetails.country,
buildingDetails.productID,
buildingDetails.buildingAge,
buildingDetails.buildingManager
)
}
}
在函數簽名中
def initializeBroadcastVariable(valuesPushed:java.lang.Iterable[BuildingInformation]): Map[Int,BuildingInformation]
java.lang.Iterable 的資格是我的補充。 沒有這個,編譯器會在 Intellij 中抱怨。
在運行時,應用程序在這一點上失敗,我正在通過框架從傳遞給open()函數的 Iterable[BuildingInformation] 創建映射:
java.lang.Exception: The user defined 'open()' method caused an exception: scala.collection.immutable.$colon$colon cannot be cast to org.nirmalya.hortonworks.tutorial.BuildingInformation
at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:475)
at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:345)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: scala.collection.immutable.$colon$colon cannot be cast to org.nirmalya.hortonworks.tutorial.BuildingInformation
at org.nirmalya.hortonworks.tutorial.HVACReadingsAnalysis$HVACToBuildingMapper$$anon$7$$anonfun$initializeBroadcastVariable$1.apply(HVACReadingsAnalysis.scala:139)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.nirmalya.hortonworks.tutorial.HVACReadingsAnalysis$HVACToBuildingMapper$$anon$7.initializeBroadcastVariable(HVACReadingsAnalysis.scala:139)
at org.nirmalya.hortonworks.tutorial.HVACReadingsAnalysis$HVACToBuildingMapper$$anon$7.initializeBroadcastVariable(HVACReadingsAnalysis.scala:133)
at org.apache.flink.runtime.broadcast.BroadcastVariableMaterialization.getVariable(BroadcastVariableMaterialization.java:234)
at org.apache.flink.runtime.operators.util.DistributedRuntimeUDFContext.getBroadcastVariableWithInitializer(DistributedRuntimeUDFContext.java:84)
at org.nirmalya.hortonworks.tutorial.HVACReadingsAnalysis$HVACToBuildingMapper.open(HVACReadingsAnalysis.scala:131)
at org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:38)
at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:471)
... 3 more
09:28:54,389 INFO org.apache.flink.runtime.client.JobClientActor - 04/29/2016 09:28:54 Job execution switched to status FAILED.
通過假設這可能是從 (Java) Iterable 轉換 case 類失敗的特殊情況(雖然我不相信自己),我嘗試用它的所有成員字段的 Tuple5 替換 BuildingInformation。 行為沒有改變。
我本可以通過提供CanBuildFrom 來嘗試,但我沒有做到。 我的想法是拒絕一個簡單的案例類不能映射到另一個數據結構。 有什么不對的,這對我來說並不明顯。
為了完成這篇文章,我嘗試了對應於 Scala 2.11.x 和 Scala 2.10.x 的 Flink 版本:行為是相同的。
此外,這里是 EnhancedHVACTempReading(為了更好地理解代碼):
case class EnhancedHVACTempReading(buildingID: Int, rangeOfTemp: String, extremeIndicator: Boolean,country: String, productID: String,buildingAge: Int, buildingManager: String)
我有一種預感,JVM 的不適與 Java 的 Iterable 被用作 Scala 的列表有關,但是,我當然不確定。
有人可以幫我找出錯誤嗎?
問題是你必須在readBuildingInfo
的map
函數中返回一些東西。 此外,如果您提供List[BuildingInformation]
,則不應使用fromElements
,而應使用fromCollection
如果您想展平列表。 以下代碼片段顯示了必要的更改。
def main(args: Array[String]): Unit = {
val envDefault = ExecutionEnvironment.getExecutionEnvironment
val buildingsBroadcastSet = readBuildingInfo(envDefault,"./SensorFiles/building.csv")
val hvacStream = readHVACReadings(envDefault,"./SensorFiles/HVAC.csv")
hvacStream
.map(new HVACToBuildingMapper)
.withBroadcastSet(buildingsBroadcastSet,"buildingData")
.writeAsCsv("./hvacTemp.csv")
envDefault.execute("HVAC Simulation")
}
和
private def readBuildingInfo(env: ExecutionEnvironment, inputPath: String): DataSet[BuildingInformation] = {
val input = Source.fromFile(inputPath).getLines.drop(1).map(datum => {
val fields = datum.split(",")
BuildingInformation(
fields(0).toInt, // buildingID
fields(1), // buildingManager
fields(2).toInt, // buildingAge
fields(3), // productID
fields(4) // Country
)
})
env.fromCollection(input.toList)
}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.