如何使用案例类将简单的DataFrame转换为DataSet Spark Scala？

Question

I am trying to convert a simple DataFrame to a DataSet from the example in Spark: https://spark.apache.org/docs/latest/sql-programming-guide.html 我试图从Spark中的示例将简单的DataFrame转换为DataSet： https ://spark.apache.org/docs/latest/sql-programming-guide.html

case class Person(name: String, age: Int)    
import spark.implicits._

val path = "examples/src/main/resources/people.json"

val peopleDS = spark.read.json(path).as[Person]
peopleDS.show()

But the following problem arises: 但是出现了以下问题：

Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up cast `age` from bigint to int as it may truncate
The type path of the target object is:
- field (class: "scala.Int", name: "age")
- root class: ....

Can anyone help me out? 谁能帮我吗？

Edit I noticed that with Long instead of Int works! 编辑我注意到用Long代替Int可以工作！ Why is that? 这是为什么？

Also: 也：

val primitiveDS = Seq(1,2,3).toDS()
val augmentedDS = primitiveDS.map(i => ("var_" + i.toString, (i + 1).toLong))
augmentedDS.show()

augmentedDS.as[Person].show()

Prints: 印刷品：

+-----+---+
|   _1| _2|
+-----+---+
|var_1|  2|
|var_2|  3|
|var_3|  4|
+-----+---+

Exception in thread "main"
org.apache.spark.sql.AnalysisException: cannot resolve '`name`' given input columns: [_1, _2];

Can Anyone Help me out understand here? 有人可以帮我理解吗？

Answer 1

If you change Int to Long (or BigInt) it works fine: 如果将Int更改为Long（或BigInt），则可以正常工作：

case class Person(name: String, age: Long)
import spark.implicits._

val path = "examples/src/main/resources/people.json"

val peopleDS = spark.read.json(path).as[Person]
peopleDS.show()

Output: 输出：

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

EDIT: Spark.read.json by default parses numbers as Long types - it's safer to do so. 编辑：默认情况下， Spark.read.json将数字解析为Long类型-这样做更安全。 You can change the col type after using casting or udfs. 您可以在使用Cast或udfs之后更改col类型。

EDIT2: 编辑2：

To answer your 2nd question, you need to name the columns correctly before the conversion to Person will work: 要回答您的第二个问题，您需要正确命名各列，然后才能转换为Person：

val primitiveDS = Seq(1,2,3).toDS()
val augmentedDS = primitiveDS.map(i => ("var_" + i.toString, (i + 1).toLong)).
 withColumnRenamed ("_1", "name" ).
 withColumnRenamed ("_2", "age" )
augmentedDS.as[Person].show()

Outputs: 输出：

+-----+---+
| name|age|
+-----+---+
|var_1|  2|
|var_2|  3|
|var_3|  4|
+-----+---+

Answer 2

This is how you create dataset from case class 这是从案例类创建数据集的方式

case class Person(name: String, age: Long)

Keep the case class outside of the class that has below code 将案例类保留在具有以下代码的类之外

val primitiveDS = Seq(1,2,3).toDS()
val augmentedDS = primitiveDS.map(i => Person("var_" + i.toString, (i + 1).toLong))
augmentedDS.show()

augmentedDS.as[Person].show()

Hope this helped 希望这有所帮助

如何使用案例类将简单的DataFrame转换为DataSet Spark Scala？

问题描述

2 个解决方案

解决方案1
4 已采纳 2017-07-10 16:58:08

解决方案2
1 2017-07-10 17:00:30

如何使用案例类将简单的DataFrame转换为DataSet Spark Scala？

问题描述

2 个解决方案

解决方案1 4 已采纳 2017-07-10 16:58:08

解决方案2 1 2017-07-10 17:00:30

解决方案1
4 已采纳 2017-07-10 16:58:08

解决方案2
1 2017-07-10 17:00:30