spark将spark-SQL转换为RDD API

Question

Spark SQL is pretty clear to me. Spark SQL对我来说很清楚。 However, I am just getting started with spark's RDD API. 但是，我刚刚开始使用Spark的RDD API。 As spark apply function to columns in parallel points out this should allow me to get rid of slow shuffles for 由于将火花应用于并行列的功能指出，这应该让我摆脱缓慢的洗牌

def handleBias(df: DataFrame, colName: String, target: String = this.target) = {
    val w1 = Window.partitionBy(colName)
    val w2 = Window.partitionBy(colName, target)

    df.withColumn("cnt_group", count("*").over(w2))
      .withColumn("pre2_" + colName, mean(target).over(w1))
      .withColumn("pre_" + colName, coalesce(min(col("cnt_group") / col("cnt_foo_eq_1")).over(w1), lit(0D)))
      .drop("cnt_group")
  }
}

In pseudo code: df foreach column (handleBias(column) So a minimal data frame is loaded up 用伪代码： df foreach column (handleBias(column)这样就加载了一个最小的数据帧

val input = Seq(
    (0, "A", "B", "C", "D"),
    (1, "A", "B", "C", "D"),
    (0, "d", "a", "jkl", "d"),
    (0, "d", "g", "C", "D"),
    (1, "A", "d", "t", "k"),
    (1, "d", "c", "C", "D"),
    (1, "c", "B", "C", "D")
  )
  val inputDf = input.toDF("TARGET", "col1", "col2", "col3TooMany", "col4")

but fails to map correctly 但无法正确映射

val rdd1_inputDf = inputDf.rdd.flatMap { x => {(0 until x.size).map(idx => (idx, x(idx)))}}
      rdd1_inputDf.toDF.show

It fails with 它失败了

java.lang.ClassNotFoundException: scala.Any
java.lang.ClassNotFoundException: scala.Any

An example can be found https://github.com/geoHeil/sparkContrastCoding respectively https://github.com/geoHeil/sparkContrastCoding/blob/master/src/main/scala/ColumnParallel.scala for the problem outlined in this question. 对于此问题中概述的问题，可以分别在https://github.com/geoHeil/sparkContrastCoding https://github.com/geoHeil/sparkContrastCoding/blob/master/src/main/scala/ColumnParallel.scala中找到一个示例。

Answer 1

When you call .rdd on a DataFrame you get an RDD[Row] which is not strongly typed. 当你调用.rdd在DataFrame你得到一个RDD[Row]这不是强类型。 If you want to be able to map over the elements you will need to pattern match over Row : 如果您希望能够映射到元素，则需要在Row进行模式匹配：

scala> val input = Seq(
     |     (0, "A", "B", "C", "D"),
     |     (1, "A", "B", "C", "D"),
     |     (0, "d", "a", "jkl", "d"),
     |     (0, "d", "g", "C", "D"),
     |     (1, "A", "d", "t", "k"),
     |     (1, "d", "c", "C", "D"),
     |     (1, "c", "B", "C", "D")
     |   )
input: Seq[(Int, String, String, String, String)] = List((0,A,B,C,D), (1,A,B,C,D), (0,d,a,jkl,d), (0,d,g,C,D), (1,A,d,t,k), (1,d,c,C,D), (1,c,B,C,D))

scala> val inputDf = input.toDF("TARGET", "col1", "col2", "col3TooMany", "col4") 
inputDf: org.apache.spark.sql.DataFrame = [TARGET: int, col1: string ... 3 more fields]

scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala> val rowRDD = inputDf.rdd
rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[3] at rdd at <console>:27

scala> val typedRDD = rowRDD.map{case Row(a: Int, b: String, c: String, d: String, e: String) => (a,b,c,d,e)}
typedRDD: org.apache.spark.rdd.RDD[(Int, String, String, String, String)] = MapPartitionsRDD[20] at map at <console>:29

scala> typedRDD.keyBy(_._1).groupByKey.foreach{println}
[Stage 7:>                                                          (0 + 0) / 4]
(0,CompactBuffer((A,B,C,D), (d,a,jkl,d), (d,g,C,D)))
(1,CompactBuffer((A,B,C,D), (A,d,t,k), (d,c,C,D), (c,B,C,D)))

Otherwise you can use a typed Dataset : 否则，您可以使用类型化的Dataset ：

scala> val ds = input.toDS
ds: org.apache.spark.sql.Dataset[(Int, String, String, String, String)] = [_1: int, _2: string ... 3 more fields]

scala> ds.rdd
res2: org.apache.spark.rdd.RDD[(Int, String, String, String, String)] = MapPartitionsRDD[8] at rdd at <console>:30

scala> ds.rdd.keyBy(_._1).groupByKey.foreach{println}
[Stage 0:>                                                          (0 + 0) / 4]
(0,CompactBuffer((0,A,B,C,D), (0,d,a,jkl,d), (0,d,g,C,D)))
(1,CompactBuffer((1,A,B,C,D), (1,A,d,t,k), (1,d,c,C,D), (1,c,B,C,D)))

spark将spark-SQL转换为RDD API

问题描述

1 个解决方案

解决方案1
3 已采纳 2017-01-02 23:30:33

spark将spark-SQL转换为RDD API

问题描述

1 个解决方案

解决方案1 3 已采纳 2017-01-02 23:30:33

解决方案1
3 已采纳 2017-01-02 23:30:33