Error in returning spark rdd from a function called inside the map function

Question

I have a collection of rowkeys (plants as shown below) from hbase table and I want to make a fetchData function which returns rdd data for rowkeys from the collection. Goal is to get union of RDDs from fetchData method for each element from plants collection. I have given the relevant part of code below. My issue is, the code is giving compilation error for return type of fetchData:

println("PartB: "+ hBaseRDD.getNumPartitions)

error: value getNumPartitions is not a member of Option[org.apache.spark.rdd.RDD[it.nerdammer.spark.test.sys.Record]]

I am using scala 2.11.8 spark 2.2.0 and maven compilation

import it.nerdammer.spark.hbase._
import org.apache.spark.sql._
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
import org.apache.log4j.Level
import org.apache.log4j.Logger
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object sys {
  case class systems( rowkey: String, iacp: Option[String], temp: Option[String])

  val spark = SparkSession.builder().appName("myApp").config("spark.executor.cores",4).getOrCreate()
  import spark.implicits._

  type Record = (String, Option[String], Option[String])

  def fetchData(plant: String): RDD[Record] = {
    val start_index = plant
    val end_index = plant + "z"
    //The below command works fine if I run it in main function, but to get multiple rows from hbase, I am using it in a separate function
    spark.sparkContext.hbaseTable[Record]("test_table").select("iacp","temp").inColumnFamily("pp").withStartRow(start_index).withStopRow(end_index)

  }

  def main(args: Array[String]) {
    //the below elements in the collection are prefix of relevant rowkeys in hbase table ("test_table") 
    val plants = Vector("a8","cu","aw","fx")
    val hBaseRDD = plants.map( pp => fetchData(pp))
    println("Part: "+ hBaseRDD.getNumPartitions)
    /*
      rest of the code
    */
  }

}

Here is the working version of code. The problem here is that I am using for loop and I have to request data corresponding to rowkey (plants) vector from HBase in each loop instead of getting all the data first and then executing rest of the codes

    import it.nerdammer.spark.hbase._
    import org.apache.spark.sql._
    import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
    import org.apache.log4j.Level
    import org.apache.log4j.Logger
    import org.apache.spark.sql.SparkSession
    import org.apache.spark.sql.functions._
    object sys {
      case class systems( rowkey: String, iacp: Option[String], temp: Option[String])
      def main(args: Array[String]) {
        
        val spark = SparkSession.builder().appName("myApp").config("spark.executor.cores",4).getOrCreate()
        import spark.implicits._

        type Record = (String, Option[String], Option[String])
        val plants = Vector("a8","cu","aw","fx")
        
        for (plant <- plants){
          val start_index = plant
          val end_index = plant + "z"
          val hBaseRDD = spark.sparkContext.hbaseTable[Record]("test_table").select("iacp","temp").inColumnFamily("pp").withStartRow(start_index).withStopRow(end_index)
          println("Part: "+ hBaseRDD.getNumPartitions)
          /*
            rest of the code
          */
        }
      }
    }

After trying, this is where I am stuck now. So how can I cast the type to required.

scala>   def fetchData(plant: String) = {
     |     val start_index = plant
     |     val end_index = plant + "~"
     |     val x1 = spark.sparkContext.hbaseTable[Record]("test_table").select("iacp","temp").inColumnFamily("pp").withStartRow(start_index).withStopRow(end_index)
     |     x1
     |   }

Define the function in REPL and running it

scala> val hBaseRDD = plants.map( pp => fetchData(pp)).reduceOption(_ union _)
<console>:39: error: type mismatch;
 found   : org.apache.spark.rdd.RDD[(String, Option[String], Option[String])]
 required: it.nerdammer.spark.hbase.HBaseReaderBuilder[(String, Option[String], Option[String])]
       val hBaseRDD = plants.map( pp => fetchData(pp)).reduceOption(_ union _)

Thanks in Advance!

Answer 1

The type of hBaseRDD is Vector[_] and not RDD[_] , so you can't execute method getNumPartitions on it. If I understand correctly you want to union fetched RDDs. You can do it by plants.map( pp => fetchData(pp)).reduceOption(_ union _) (I recommend to use reduceOption because it won't fail on empty list, but you can use reduce if you confident that list is not empty)

Also the returned type of fetchData is RDD[U] , but I didn't find any definition of U . Probably this is the reason why compiler infer Vector[Nothing] instead of Vector[RDD[Record]] . In order to avoid subsequent errors you should also change RDD[U] to RDD[Record] .

Error in returning spark rdd from a function called inside the map function

Question

1 answers

solution1
3 2019-05-04 11:47:41

Error in returning spark rdd from a function called inside the map function

Question

1 answers

solution1 3 2019-05-04 11:47:41

solution1
3 2019-05-04 11:47:41