[英]Convert RDD to Dataframe in Spark/Scala
RDD已以Array[Array[String]]
格式創建,並具有以下值:
val rdd : Array[Array[String]] = Array(
Array("4580056797", "0", "2015-07-29 10:38:42", "0", "1", "1"),
Array("4580056797", "0", "2015-07-29 10:38:43", "0", "1", "1"))
我想用架構創建一個dataFrame:
val schemaString = "callId oCallId callTime duration calltype swId"
下一步:
scala> val rowRDD = rdd.map(p => Array(p(0), p(1), p(2),p(3),p(4),p(5).trim))
rowRDD: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[14] at map at <console>:39
scala> val calDF = sqlContext.createDataFrame(rowRDD, schema)
給出以下錯誤:
console:45: error: overloaded method value createDataFrame with alternatives:
(rdd: org.apache.spark.api.java.JavaRDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rdd: org.apache.spark.rdd.RDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rowRDD: org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame <and>
(rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame
cannot be applied to (org.apache.spark.rdd.RDD[Array[String]],
org.apache.spark.sql.types.StructType)
val calDF = sqlContext.createDataFrame(rowRDD, schema)
只需粘貼到spark-shell
:
val a =
Array(
Array("4580056797", "0", "2015-07-29 10:38:42", "0", "1", "1"),
Array("4580056797", "0", "2015-07-29 10:38:42", "0", "1", "1"))
val rdd = sc.makeRDD(a)
case class X(callId: String, oCallId: String,
callTime: String, duration: String, calltype: String, swId: String)
然后在RDD上map()
以創建案例類的實例,然后使用toDF()
創建DataFrame:
scala> val df = rdd.map {
case Array(s0, s1, s2, s3, s4, s5) => X(s0, s1, s2, s3, s4, s5) }.toDF()
df: org.apache.spark.sql.DataFrame =
[callId: string, oCallId: string, callTime: string,
duration: string, calltype: string, swId: string]
這推斷出案例類的架構。
然后你可以繼續:
scala> df.printSchema()
root
|-- callId: string (nullable = true)
|-- oCallId: string (nullable = true)
|-- callTime: string (nullable = true)
|-- duration: string (nullable = true)
|-- calltype: string (nullable = true)
|-- swId: string (nullable = true)
scala> df.show()
+----------+-------+-------------------+--------+--------+----+
| callId|oCallId| callTime|duration|calltype|swId|
+----------+-------+-------------------+--------+--------+----+
|4580056797| 0|2015-07-29 10:38:42| 0| 1| 1|
|4580056797| 0|2015-07-29 10:38:42| 0| 1| 1|
+----------+-------+-------------------+--------+--------+----+
如果你想在普通程序中使用toDF()
(而不是在spark-shell
),請確保(引自此處 ):
SQLContext
后立即import sqlContext.implicits._
SQLContext
toDF()
在方法之外定義case類 您需要首先將Array
轉換為Row
,然后定義架構。 我假設你的大部分領域都很Long
val rdd: RDD[Array[String]] = ???
val rows: RDD[Row] = rdd map {
case Array(callId, oCallId, callTime, duration, swId) =>
Row(callId.toLong, oCallId.toLong, callTime, duration.toLong, swId.toLong)
}
object schema {
val callId = StructField("callId", LongType)
val oCallId = StructField("oCallId", StringType)
val callTime = StructField("callTime", StringType)
val duration = StructField("duration", LongType)
val swId = StructField("swId", LongType)
val struct = StructType(Array(callId, oCallId, callTime, duration, swId))
}
sqlContext.createDataFrame(rows, schema.struct)
我假設您的schema
與Spark指南一樣 ,如下所示:
val schema =
StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
如果你看一下createDataFrame的簽名,這里接受一個StructType作為第二個參數(對於Scala)
def createDataFrame(rowRDD:RDD [Row],schema:StructType):DataFrame
使用給定的模式從包含Rows的RDD創建DataFrame。
所以它接受RDD[Row]
作為第一個參數。 你在rowRDD
中rowRDD
是RDD[Array[String]]
因此存在不匹配。
你需要一個RDD[Array[String]]
嗎?
否則,您可以使用以下內容創建數據幀:
val rowRDD = rdd.map(p => Row(p(0), p(1), p(2),p(3),p(4),p(5).trim))
使用spark 1.6.1
和scala 2.10
我得到了相同的錯誤error: overloaded method value createDataFrame with alternatives:
對我來說,gotcha是createDataFrame
中的簽名,我試圖使用val rdd : List[Row]
,但它失敗了,因為java.util.List[org.apache.spark.sql.Row]
和scala.collection.immutable.List[org.apache.spark.sql.Row]
不一樣。
我找到的工作解決方案是通過List[Array[String]]
將val rdd : Array[Array[String]]
轉換為RDD[Row]
List[Array[String]]
。 我發現這是最接近文檔中的內容
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType,StructField,StringType};
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val rdd_original : Array[Array[String]] = Array(
Array("4580056797", "0", "2015-07-29 10:38:42", "0", "1", "1"),
Array("4580056797", "0", "2015-07-29 10:38:42", "0", "1", "1"))
val rdd : List[Array[String]] = rdd_original.toList
val schemaString = "callId oCallId callTime duration calltype swId"
// Generate the schema based on the string of schema
val schema =
StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
// Convert records of the RDD to Rows.
val rowRDD = rdd.map(p => Row(p: _*)) // using splat is easier
// val rowRDD = rdd.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5))) // this also works
val df = sqlContext.createDataFrame(sc.parallelize(rowRDD:List[Row]), schema)
df.show
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.