[英]How to convert a dataframe to dataset in Apache Spark in Scala?
I need to convert my dataframe to a dataset and I used the following code: 我需要将我的数据帧转换为数据集,并使用以下代码:
val final_df = Dataframe.withColumn(
"features",
toVec4(
// casting into Timestamp to parse the string, and then into Int
$"time_stamp_0".cast(TimestampType).cast(IntegerType),
$"count",
$"sender_ip_1",
$"receiver_ip_2"
)
).withColumn("label", (Dataframe("count"))).select("features", "label")
final_df.show()
val trainingTest = final_df.randomSplit(Array(0.3, 0.7))
val TrainingDF = trainingTest(0)
val TestingDF=trainingTest(1)
TrainingDF.show()
TestingDF.show()
///lets create our liner regression
val lir= new LinearRegression()
.setRegParam(0.3)
.setElasticNetParam(0.8)
.setMaxIter(100)
.setTol(1E-6)
case class df_ds(features:Vector, label:Integer)
org.apache.spark.sql.catalyst.encoders.OuterScopes.addOuterScope(this)
val Training_ds = TrainingDF.as[df_ds]
My problem is that, I got the following error: 我的问题是, 我收到以下错误:
Error:(96, 36) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
val Training_ds = TrainingDF.as[df_ds]
It seems that the number of values in dataframe is different with the number of value in my class. 似乎数据框中的值的数量与我的类中的值的数量不同。 However I am using case class df_ds(features:Vector, label:Integer)
on my TrainingDF dataframe since, It has a vector of features and an integer label. 但是我在TrainingDF数据帧上使用了case class df_ds(features:Vector, label:Integer)
,因为它有一个特征向量和一个整数标签。 Here is TrainingDF dataframe: 这是TrainingDF数据帧:
+--------------------+-----+
| features|label|
+--------------------+-----+
|[1.497325796E9,19...| 19|
|[1.497325796E9,19...| 19|
|[1.497325796E9,19...| 19|
|[1.497325796E9,19...| 19|
|[1.497325796E9,19...| 19|
|[1.497325796E9,19...| 19|
|[1.497325796E9,19...| 19|
|[1.497325796E9,19...| 19|
|[1.497325796E9,19...| 19|
|[1.497325796E9,19...| 19|
|[1.497325796E9,19...| 19|
|[1.497325796E9,19...| 19|
|[1.497325796E9,19...| 19|
|[1.497325796E9,19...| 19|
|[1.497325796E9,19...| 19|
|[1.497325796E9,19...| 19|
|[1.497325796E9,19...| 19|
|[1.497325796E9,19...| 19|
|[1.497325796E9,19...| 19|
|[1.497325796E9,10...| 10|
+--------------------+-----+
Also here is my original final_df dataframe: 这是我原来的final_df数据帧:
+------------+-----------+-------------+-----+
|time_stamp_0|sender_ip_1|receiver_ip_2|count|
+------------+-----------+-------------+-----+
| 05:49:56| 10.0.0.2| 10.0.0.3| 19|
| 05:49:56| 10.0.0.2| 10.0.0.3| 19|
| 05:49:56| 10.0.0.2| 10.0.0.3| 19|
| 05:49:56| 10.0.0.2| 10.0.0.3| 19|
| 05:49:56| 10.0.0.2| 10.0.0.3| 19|
| 05:49:56| 10.0.0.2| 10.0.0.3| 19|
| 05:49:56| 10.0.0.2| 10.0.0.3| 19|
| 05:49:56| 10.0.0.2| 10.0.0.3| 19|
| 05:49:56| 10.0.0.2| 10.0.0.3| 19|
| 05:49:56| 10.0.0.2| 10.0.0.3| 19|
| 05:49:56| 10.0.0.2| 10.0.0.3| 19|
| 05:49:56| 10.0.0.2| 10.0.0.3| 19|
| 05:49:56| 10.0.0.2| 10.0.0.3| 19|
| 05:49:56| 10.0.0.2| 10.0.0.3| 19|
| 05:49:56| 10.0.0.2| 10.0.0.3| 19|
| 05:49:56| 10.0.0.2| 10.0.0.3| 19|
| 05:49:56| 10.0.0.2| 10.0.0.3| 19|
| 05:49:56| 10.0.0.2| 10.0.0.3| 19|
| 05:49:56| 10.0.0.2| 10.0.0.3| 19|
| 05:49:56| 10.0.0.3| 10.0.0.2| 10|
+------------+-----------+-------------+-----+
However I got the mentioned error! 但是我得到了上面提到的错误! Can anybody help me? 有谁能够帮助我? Thanks in advance. 提前致谢。
The error message you are reading is a pretty good pointer. 您正在阅读的错误消息是一个非常好的指针。
When you convert a DataFrame
to a Dataset
you have to have a proper Encoder
for whatever is stored in the DataFrame
rows. 当您将DataFrame
转换为Dataset
您必须拥有适当的Encoder
,以存储在DataFrame
行中的任何内容。
Encoders for primitive-like types ( Int
s, String
s, and so on) and case classes
are provided by just importing the implicits for your SparkSession
like follows: 类似原始类型( Int
s, String
s等)和case classes
编码器只需导入SparkSession
的implicits,如下所示:
case class MyData(intField: Int, boolField: Boolean) // e.g.
val spark: SparkSession = ???
val df: DataFrame = ???
import spark.implicits._
val ds: Dataset[MyData] = df.as[MyData]
If that doesn't work either is because the type you are trying to cast the DataFrame
to isn't supported. 如果不工作,要么是因为你想投的类型DataFrame
是不支持。 In that case, you would have to write your own Encoder
: you may find more information about it here and see an example (the Encoder
for java.time.LocalDateTime
) here . 在这种情况下,你会写自己的Encoder
:您可能会发现更多关于它的信息在这里看到一个例子(该Encoder
用于java.time.LocalDateTime
) 这里 。
Spark 1.6.0 Spark 1.6.0
case class MyCase(id: Int, name: String)
val encoder = org.apache.spark.sql.catalyst.encoders.ExpressionEncoder[MyCase]
val dataframe = …
val dataset = dataframe.as(encoder)
Spark 2.0 or above Spark 2.0或以上
case class MyCase(id: Int, name: String)
val encoder = org.apache.spark.sql.Encoders.product[MyCase]
val dataframe = …
val dataset = dataframe.as(encoder)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.