简体   繁体   English

如何使用案例类将简单的DataFrame转换为DataSet Spark Scala?

[英]How to convert a simple DataFrame to a DataSet Spark Scala with case class?

I am trying to convert a simple DataFrame to a DataSet from the example in Spark: https://spark.apache.org/docs/latest/sql-programming-guide.html 我试图从Spark中的示例将简单的DataFrame转换为DataSet: https ://spark.apache.org/docs/latest/sql-programming-guide.html

case class Person(name: String, age: Int)    
import spark.implicits._

val path = "examples/src/main/resources/people.json"

val peopleDS = spark.read.json(path).as[Person]
peopleDS.show()

But the following problem arises: 但是出现了以下问题:

Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up cast `age` from bigint to int as it may truncate
The type path of the target object is:
- field (class: "scala.Int", name: "age")
- root class: ....

Can anyone help me out? 谁能帮我吗?

Edit I noticed that with Long instead of Int works! 编辑我注意到用Long代替Int可以工作! Why is that? 这是为什么?

Also: 也:

val primitiveDS = Seq(1,2,3).toDS()
val augmentedDS = primitiveDS.map(i => ("var_" + i.toString, (i + 1).toLong))
augmentedDS.show()

augmentedDS.as[Person].show()

Prints: 印刷品:

+-----+---+
|   _1| _2|
+-----+---+
|var_1|  2|
|var_2|  3|
|var_3|  4|
+-----+---+

Exception in thread "main"
org.apache.spark.sql.AnalysisException: cannot resolve '`name`' given input columns: [_1, _2];

Can Anyone Help me out understand here? 有人可以帮我理解吗?

If you change Int to Long (or BigInt) it works fine: 如果将Int更改为Long(或BigInt),则可以正常工作:

case class Person(name: String, age: Long)
import spark.implicits._

val path = "examples/src/main/resources/people.json"

val peopleDS = spark.read.json(path).as[Person]
peopleDS.show()

Output: 输出:

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

EDIT: Spark.read.json by default parses numbers as Long types - it's safer to do so. 编辑:默认情况下, Spark.read.json将数字解析为Long类型-这样做更安全。 You can change the col type after using casting or udfs. 您可以在使用Cast或udfs之后更改col类型。

EDIT2: 编辑2:

To answer your 2nd question, you need to name the columns correctly before the conversion to Person will work: 要回答您的第二个问题,您需要正确命名各列,然后才能转换为Person:

val primitiveDS = Seq(1,2,3).toDS()
val augmentedDS = primitiveDS.map(i => ("var_" + i.toString, (i + 1).toLong)).
 withColumnRenamed ("_1", "name" ).
 withColumnRenamed ("_2", "age" )
augmentedDS.as[Person].show()

Outputs: 输出:

+-----+---+
| name|age|
+-----+---+
|var_1|  2|
|var_2|  3|
|var_3|  4|
+-----+---+

This is how you create dataset from case class 这是从案例类创建数据集的方式

case class Person(name: String, age: Long) 

Keep the case class outside of the class that has below code 将案例类保留在具有以下代码的类之外

val primitiveDS = Seq(1,2,3).toDS()
val augmentedDS = primitiveDS.map(i => Person("var_" + i.toString, (i + 1).toLong))
augmentedDS.show()

augmentedDS.as[Person].show()

Hope this helped 希望这有所帮助

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM