从 map 函数内部调用的函数返回 spark rdd 时出错

Question

I have a collection of rowkeys (plants as shown below) from hbase table and I want to make a fetchData function which returns rdd data for rowkeys from the collection.我有一个来自 hbase 表的行键集合（如下所示的植物），我想创建一个 fetchData 函数，该函数从集合中返回行键的 rdd 数据。 Goal is to get union of RDDs from fetchData method for each element from plants collection.目标是从植物集合的每个元素的 fetchData 方法中获得 RDD 的联合。 I have given the relevant part of code below.我在下面给出了代码的相关部分。 My issue is, the code is giving compilation error for return type of fetchData:我的问题是，代码给出了 fetchData 返回类型的编译错误：

println("PartB: "+ hBaseRDD.getNumPartitions) println("PartB:"+ hBaseRDD.getNumPartitions)

error: value getNumPartitions is not a member of Option[org.apache.spark.rdd.RDD[it.nerdammer.spark.test.sys.Record]]错误：值 getNumPartitions 不是 Option[org.apache.spark.rdd.RDD[it.nerdammer.spark.test.sys.Record]] 的成员

I am using scala 2.11.8 spark 2.2.0 and maven compilation我正在使用 scala 2.11.8 spark 2.2.0 和 maven 编译

import it.nerdammer.spark.hbase._
import org.apache.spark.sql._
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
import org.apache.log4j.Level
import org.apache.log4j.Logger
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object sys {
  case class systems( rowkey: String, iacp: Option[String], temp: Option[String])

  val spark = SparkSession.builder().appName("myApp").config("spark.executor.cores",4).getOrCreate()
  import spark.implicits._

  type Record = (String, Option[String], Option[String])

  def fetchData(plant: String): RDD[Record] = {
    val start_index = plant
    val end_index = plant + "z"
    //The below command works fine if I run it in main function, but to get multiple rows from hbase, I am using it in a separate function
    spark.sparkContext.hbaseTable[Record]("test_table").select("iacp","temp").inColumnFamily("pp").withStartRow(start_index).withStopRow(end_index)

  }

  def main(args: Array[String]) {
    //the below elements in the collection are prefix of relevant rowkeys in hbase table ("test_table") 
    val plants = Vector("a8","cu","aw","fx")
    val hBaseRDD = plants.map( pp => fetchData(pp))
    println("Part: "+ hBaseRDD.getNumPartitions)
    /*
      rest of the code
    */
  }

}

Here is the working version of code.这是代码的工作版本。 The problem here is that I am using for loop and I have to request data corresponding to rowkey (plants) vector from HBase in each loop instead of getting all the data first and then executing rest of the codes这里的问题是我正在使用 for 循环，我必须在每个循环中从 HBase 请求与 rowkey（植物）向量对应的数据，而不是先获取所有数据，然后执行其余代码

    import it.nerdammer.spark.hbase._
    import org.apache.spark.sql._
    import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
    import org.apache.log4j.Level
    import org.apache.log4j.Logger
    import org.apache.spark.sql.SparkSession
    import org.apache.spark.sql.functions._
    object sys {
      case class systems( rowkey: String, iacp: Option[String], temp: Option[String])
      def main(args: Array[String]) {
        
        val spark = SparkSession.builder().appName("myApp").config("spark.executor.cores",4).getOrCreate()
        import spark.implicits._

        type Record = (String, Option[String], Option[String])
        val plants = Vector("a8","cu","aw","fx")
        
        for (plant <- plants){
          val start_index = plant
          val end_index = plant + "z"
          val hBaseRDD = spark.sparkContext.hbaseTable[Record]("test_table").select("iacp","temp").inColumnFamily("pp").withStartRow(start_index).withStopRow(end_index)
          println("Part: "+ hBaseRDD.getNumPartitions)
          /*
            rest of the code
          */
        }
      }
    }

After trying, this is where I am stuck now.经过尝试，这就是我现在卡住的地方。 So how can I cast the type to required.那么如何将类型转换为 required.

scala>   def fetchData(plant: String) = {
     |     val start_index = plant
     |     val end_index = plant + "~"
     |     val x1 = spark.sparkContext.hbaseTable[Record]("test_table").select("iacp","temp").inColumnFamily("pp").withStartRow(start_index).withStopRow(end_index)
     |     x1
     |   }

Define the function in REPL and running it在 REPL 中定义函数并运行它

scala> val hBaseRDD = plants.map( pp => fetchData(pp)).reduceOption(_ union _)
<console>:39: error: type mismatch;
 found   : org.apache.spark.rdd.RDD[(String, Option[String], Option[String])]
 required: it.nerdammer.spark.hbase.HBaseReaderBuilder[(String, Option[String], Option[String])]
       val hBaseRDD = plants.map( pp => fetchData(pp)).reduceOption(_ union _)

Thanks in Advance!提前致谢！

Answer 1

The type of hBaseRDD is Vector[_] and not RDD[_] , so you can't execute method getNumPartitions on it. hBaseRDD的类型是Vector[_]而不是RDD[_] ，所以你不能在它上面执行getNumPartitions方法。 If I understand correctly you want to union fetched RDDs.如果我理解正确，您想联合获取的 RDD。 You can do it by plants.map( pp => fetchData(pp)).reduceOption(_ union _) (I recommend to use reduceOption because it won't fail on empty list, but you can use reduce if you confident that list is not empty)您可以通过plants.map( pp => fetchData(pp)).reduceOption(_ union _) （我建议使用reduceOption因为它不会在空列表上失败，但是如果您确信该列表，您可以使用reduce不为空）

Also the returned type of fetchData is RDD[U] , but I didn't find any definition of U . fetchData的返回类型也是RDD[U] ，但我没有找到U任何定义。 Probably this is the reason why compiler infer Vector[Nothing] instead of Vector[RDD[Record]] .可能这就是编译器推断Vector[Nothing]而不是Vector[RDD[Record]] 。 In order to avoid subsequent errors you should also change RDD[U] to RDD[Record] .为了避免后续错误，您还应该将RDD[U]更改为RDD[Record] 。

从 map 函数内部调用的函数返回 spark rdd 时出错

问题描述

1 个解决方案

解决方案1
3 2019-05-04 11:47:41

从 map 函数内部调用的函数返回 spark rdd 时出错

问题描述

1 个解决方案

解决方案1 3 2019-05-04 11:47:41

解决方案1
3 2019-05-04 11:47:41