简体   繁体   中英

Apache Flink: transforming Broadcast variables fails, but I can't determine why

I am trying to prepare a small sample application on Apache Flink, the main intention being to demonstrate how to use Broadcast variables. This application reads a CSV file and prepares a DataSet[BuildingInformation]

case class BuildingInformation(
     buildingID: Int, buildingManager: String, buildingAge: Int,
     productID: String, country: String
)

This is how, I am creating the BuildingInformation DataSet, at the moment:

val buildingsBroadcastSet =
      envDefault
      .fromElements(
                     readBuildingInfo(
                        envDefault,
                        "./SensorFiles/building.csv")
                   )

And, later, I begin the transformation thus:

val hvacStream = readHVACReadings(envDefault,"./SensorFiles/HVAC.csv")

    hvacStream
      .map(new HVACToBuildingMapper)
      .withBroadcastSet(buildingsBroadcastSet,"buildingData")
      .writeAsCsv("./hvacTemp.csv")

A map of ( buildingID -> BuildingInformation ) is what I want as a reference data which is broadcast. To get this ready, I have implemented a RichMapFunction:

 class HVACToBuildingMapper
    extends RichMapFunction  [HVACData,EnhancedHVACTempReading] {

    var allBuildingDetails: Map[Int, BuildingInformation] = _

    override def open(configuration: Configuration): Unit = {

      allBuildingDetails =
        getRuntimeContext
        .getBroadcastVariableWithInitializer(
          "buildingData",
          new BroadcastVariableInitializer [BuildingInformation,Map[Int,BuildingInformation]] {

            def initializeBroadcastVariable(valuesPushed:java.lang.Iterable[BuildingInformation]): Map[Int,BuildingInformation] = {
              valuesPushed
                .asScala
                .toList
              .map(nextBuilding => (nextBuilding.buildingID,nextBuilding))(breakOut)
            }
          }
        )
    }
    override def map(nextReading: HVACData): EnhancedHVACTempReading = {
      val buildingDetails = allBuildingDetails.getOrElse(nextReading.buildingID,UndefinedBuildingInformation)
      // ... more intermediate data creation logic here
      EnhancedHVACTempReading(
        nextReading.buildingID,
        rangeOfTempRecorded,
        isExtremeTempRecorded,
        buildingDetails.country,
        buildingDetails.productID,
        buildingDetails.buildingAge,
        buildingDetails.buildingManager
      )
    }

  }

In the function signature

def initializeBroadcastVariable(valuesPushed:java.lang.Iterable[BuildingInformation]): Map[Int,BuildingInformation]

qualification with java.lang.Iterable is my addition. Without this, the compiler complains in Intellij.

At runtime, the application fails at the point, where I am creating a map out of the Iterable[BuildingInformation] that is passed to open() function, by the framework:

java.lang.Exception: The user defined 'open()' method caused an exception: scala.collection.immutable.$colon$colon cannot be cast to org.nirmalya.hortonworks.tutorial.BuildingInformation
    at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:475)
    at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:345)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: scala.collection.immutable.$colon$colon cannot be cast to org.nirmalya.hortonworks.tutorial.BuildingInformation
    at org.nirmalya.hortonworks.tutorial.HVACReadingsAnalysis$HVACToBuildingMapper$$anon$7$$anonfun$initializeBroadcastVariable$1.apply(HVACReadingsAnalysis.scala:139)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
    at scala.collection.AbstractTraversable.map(Traversable.scala:105)
    at org.nirmalya.hortonworks.tutorial.HVACReadingsAnalysis$HVACToBuildingMapper$$anon$7.initializeBroadcastVariable(HVACReadingsAnalysis.scala:139)
    at org.nirmalya.hortonworks.tutorial.HVACReadingsAnalysis$HVACToBuildingMapper$$anon$7.initializeBroadcastVariable(HVACReadingsAnalysis.scala:133)
    at org.apache.flink.runtime.broadcast.BroadcastVariableMaterialization.getVariable(BroadcastVariableMaterialization.java:234)
    at org.apache.flink.runtime.operators.util.DistributedRuntimeUDFContext.getBroadcastVariableWithInitializer(DistributedRuntimeUDFContext.java:84)
    at org.nirmalya.hortonworks.tutorial.HVACReadingsAnalysis$HVACToBuildingMapper.open(HVACReadingsAnalysis.scala:131)
    at org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:38)
    at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:471)
    ... 3 more
09:28:54,389 INFO  org.apache.flink.runtime.client.JobClientActor                - 04/29/2016 09:28:54  Job execution switched to status FAILED.

By assuming that it perhaps was a particular case of failure of transforming a case class from an (Java) Iterable (I was not convinced myself though), I tried replacing BuildingInformation with a Tuple5 of all its member fields. The behaviour didn't change.

I could have tried by providing a CanBuildFrom , but I stopped short of it. My mind refused that a simple case class could not be mapped to another data structure. There is something wrong, which is not obvious to me.

Just to complete the post, I have tried with Flink versions corresponding to Scala 2.11.x and Scala 2.10.x: the behaviour was the same.

Also, here is EnhancedHVACTempReading (for better comprehension of the code):

case class EnhancedHVACTempReading(buildingID: Int, rangeOfTemp: String, extremeIndicator: Boolean,country: String, productID: String,buildingAge: Int, buildingManager: String)

I have a hunch that the JVM's discomfiture has something to do with Java's Iterable being used as Scala's List, but then, I am not sure of course.

Could someone help me to spot the mistake?

The problem is that you have to return something in the map function in readBuildingInfo . Furthermore, you shouldn't use fromElements if you provide a List[BuildingInformation] but instead use fromCollection if you want to flatten the list. The following code snippets show the necessary changes.

def main(args: Array[String]): Unit = {
    val envDefault = ExecutionEnvironment.getExecutionEnvironment

    val buildingsBroadcastSet = readBuildingInfo(envDefault,"./SensorFiles/building.csv")

    val hvacStream = readHVACReadings(envDefault,"./SensorFiles/HVAC.csv")

    hvacStream
      .map(new HVACToBuildingMapper)
      .withBroadcastSet(buildingsBroadcastSet,"buildingData")
      .writeAsCsv("./hvacTemp.csv")

    envDefault.execute("HVAC Simulation")
}

And

private def readBuildingInfo(env: ExecutionEnvironment, inputPath: String): DataSet[BuildingInformation] = {
    val input = Source.fromFile(inputPath).getLines.drop(1).map(datum => {

      val fields = datum.split(",")
      BuildingInformation(
          fields(0).toInt,     // buildingID
          fields(1),           // buildingManager
          fields(2).toInt,     // buildingAge
          fields(3),           // productID
          fields(4)            // Country
        )
    })
    env.fromCollection(input.toList)
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM