在Spark / Scala中将RDD转换为Dataframe

Question

The RDD has been created in the format Array[Array[String]] and has the following values: RDD已以Array[Array[String]]格式创建，并具有以下值：

val rdd : Array[Array[String]] = Array(
Array("4580056797", "0", "2015-07-29 10:38:42", "0", "1", "1"), 
Array("4580056797", "0", "2015-07-29 10:38:43", "0", "1", "1"))

I want to create a dataFrame with the schema : 我想用架构创建一个dataFrame：

val schemaString = "callId oCallId callTime duration calltype swId"

Next steps: 下一步：

scala> val rowRDD = rdd.map(p => Array(p(0), p(1), p(2),p(3),p(4),p(5).trim))
rowRDD: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[14] at map at <console>:39
scala> val calDF = sqlContext.createDataFrame(rowRDD, schema)

Gives the following error: 给出以下错误：

console:45: error: overloaded method value createDataFrame with alternatives:
     (rdd: org.apache.spark.api.java.JavaRDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
    (rdd: org.apache.spark.rdd.RDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
    (rowRDD: org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame <and>
    (rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame
    cannot be applied to (org.apache.spark.rdd.RDD[Array[String]],   
    org.apache.spark.sql.types.StructType)
       val calDF = sqlContext.createDataFrame(rowRDD, schema)

Answer 1

Just paste into a spark-shell : 只需粘贴到spark-shell ：

val a = 
  Array(
    Array("4580056797", "0", "2015-07-29 10:38:42", "0", "1", "1"), 
    Array("4580056797", "0", "2015-07-29 10:38:42", "0", "1", "1"))

val rdd = sc.makeRDD(a)

case class X(callId: String, oCallId: String, 
  callTime: String, duration: String, calltype: String, swId: String)

Then map() over the RDD to create instances of the case class, and then create the DataFrame using toDF() : 然后在RDD上map()以创建案例类的实例，然后使用toDF()创建DataFrame：

scala> val df = rdd.map { 
  case Array(s0, s1, s2, s3, s4, s5) => X(s0, s1, s2, s3, s4, s5) }.toDF()
df: org.apache.spark.sql.DataFrame = 
  [callId: string, oCallId: string, callTime: string, 
    duration: string, calltype: string, swId: string]

This infers the schema from the case class. 这推断出案例类的架构。

Then you can proceed with: 然后你可以继续：

scala> df.printSchema()
root
 |-- callId: string (nullable = true)
 |-- oCallId: string (nullable = true)
 |-- callTime: string (nullable = true)
 |-- duration: string (nullable = true)
 |-- calltype: string (nullable = true)
 |-- swId: string (nullable = true)

scala> df.show()
+----------+-------+-------------------+--------+--------+----+
|    callId|oCallId|           callTime|duration|calltype|swId|
+----------+-------+-------------------+--------+--------+----+
|4580056797|      0|2015-07-29 10:38:42|       0|       1|   1|
|4580056797|      0|2015-07-29 10:38:42|       0|       1|   1|
+----------+-------+-------------------+--------+--------+----+

If you want to use toDF() in a normal program (not in the spark-shell ), make sure (quoted from here ): 如果你想在普通程序中使用toDF() （而不是在spark-shell ），请确保（引自此处）：

To import sqlContext.implicits._ right after creating the SQLContext 在创建SQLContext后立即import sqlContext.implicits._ SQLContext
Define the case class outside of the method using toDF() 使用toDF()在方法之外定义case类

Answer 2

You need to convert first you Array into Row and then define schema. 您需要首先将Array转换为Row ，然后定义架构。 I made assumption that most of your fields are Long 我假设你的大部分领域都很Long

    val rdd: RDD[Array[String]] = ???
    val rows: RDD[Row] = rdd map {
      case Array(callId, oCallId, callTime, duration, swId) =>
        Row(callId.toLong, oCallId.toLong, callTime, duration.toLong, swId.toLong)
    }

    object schema {
      val callId = StructField("callId", LongType)
      val oCallId = StructField("oCallId", StringType)
      val callTime = StructField("callTime", StringType)
      val duration = StructField("duration", LongType)
      val swId = StructField("swId", LongType)

      val struct = StructType(Array(callId, oCallId, callTime, duration, swId))
    }

    sqlContext.createDataFrame(rows, schema.struct)

Answer 3

I assume that your schema is, like in the Spark Guide , as follow: 我假设您的schema与Spark指南一样，如下所示：

val schema =
  StructType(
    schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))

If you look at the signature of the createDataFrame , here is the one that accepts a StructType as 2nd argument (for Scala) 如果你看一下createDataFrame的签名，这里接受一个StructType作为第二个参数（对于Scala）

def createDataFrame(rowRDD: RDD[Row], schema: StructType): DataFrame def createDataFrame（rowRDD：RDD [Row]，schema：StructType）：DataFrame

Creates a DataFrame from an RDD containing Rows using the given schema. 使用给定的模式从包含Rows的RDD创建DataFrame。

So it accepts as 1st argument a RDD[Row] . 所以它接受RDD[Row]作为第一个参数。 What you have in rowRDD is a RDD[Array[String]] so there is a mismatch. 你在rowRDD中rowRDD是RDD[Array[String]]因此存在不匹配。

Do you need an RDD[Array[String]] ? 你需要一个RDD[Array[String]]吗？

Otherwise you can use the following to create your dataframe: 否则，您可以使用以下内容创建数据帧：

val rowRDD = rdd.map(p => Row(p(0), p(1), p(2),p(3),p(4),p(5).trim))

Answer 4

Using spark 1.6.1 and scala 2.10 使用spark 1.6.1和scala 2.10

I got the same error error: overloaded method value createDataFrame with alternatives: 我得到了相同的错误error: overloaded method value createDataFrame with alternatives:

For me, gotcha was the signature in createDataFrame , I was trying to use the val rdd : List[Row] , but it failed because java.util.List[org.apache.spark.sql.Row] and scala.collection.immutable.List[org.apache.spark.sql.Row] are NOT the same. 对我来说，gotcha是createDataFrame中的签名，我试图使用val rdd : List[Row] ，但它失败了，因为java.util.List[org.apache.spark.sql.Row]和scala.collection.immutable.List[org.apache.spark.sql.Row]不一样。

The working solution I've found is I would convert val rdd : Array[Array[String]] into RDD[Row] via List[Array[String]] . 我找到的工作解决方案是通过List[Array[String]]将val rdd : Array[Array[String]]转换为RDD[Row] List[Array[String]] 。 I find this is the closest to what's in the documentation 我发现这是最接近文档中的内容

import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType,StructField,StringType};
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val rdd_original : Array[Array[String]] = Array(
    Array("4580056797", "0", "2015-07-29 10:38:42", "0", "1", "1"), 
    Array("4580056797", "0", "2015-07-29 10:38:42", "0", "1", "1"))

val rdd : List[Array[String]] = rdd_original.toList

val schemaString = "callId oCallId callTime duration calltype swId"

// Generate the schema based on the string of schema
val schema =
  StructType(
    schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))

// Convert records of the RDD to Rows.
val rowRDD = rdd.map(p => Row(p: _*)) // using splat is easier
// val rowRDD = rdd.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5))) // this also works

val df = sqlContext.createDataFrame(sc.parallelize(rowRDD:List[Row]), schema)
df.show

在Spark / Scala中将RDD转换为Dataframe

问题描述

4 个解决方案

解决方案1
12 已采纳 2015-10-14 15:31:01

解决方案2
4 2015-10-14 15:02:38

解决方案3
1 2015-10-14 14:57:44

解决方案4
1 2019-04-24 06:11:10

在Spark / Scala中将RDD转换为Dataframe

问题描述

4 个解决方案

解决方案1 12 已采纳 2015-10-14 15:31:01

解决方案2 4 2015-10-14 15:02:38

解决方案3 1 2015-10-14 14:57:44

解决方案4 1 2019-04-24 06:11:10

解决方案1
12 已采纳 2015-10-14 15:31:01

解决方案2
4 2015-10-14 15:02:38

解决方案3
1 2015-10-14 14:57:44

解决方案4
1 2019-04-24 06:11:10