简体   繁体   中英

How to create Dataset (not DataFrame) without using case class but using StructType?

How can I create Dataset using StructType ?

We can create a Dataset as follows:

case class Person(name: String, age: Int)

val personDS = Seq(Person("Max", 33), Person("Adam", 32), Person("Muller", 
62)).toDS()
personDS.show()

Is there a way to create a Dataset without using a case class?

I'd like to create a DataFrame using a case class and using StructType .

If you know how to create DataFrame, you already now how to create Dataset :)

DataFrame = Dataset[Row].

What it means? Try:

val df : DataFrame = spark.createDataFrame(...) // with StructType
import org.apache.spark.sql._
val ds : Dataset[Row] = df; // no error, as DataFrame is only a type alias of Dataset[Row]

That's an interesting question in a sense that I don't see a reason why one would want it.

How can I create Dataset using "StructType"

I'd then ask a very similar question...

Why would you like to "trade" a case class with a StructType ? What would that give you that a case class could not?

The reason you use a case class is that it can offer you two things at once:

  1. Describe your schema quickly, nicely and type-safely

  2. Working with your data becomes type-safe

Regarding 1. as a Scala developer, you will define business objects that describe your data. You will have to do it anyway (unless you like tuples and _1 and such).

Regarding type-safety (in both 1. and 2.) is about transforming your data to leverage the Scala compiler that can help find places where you expect a String but have an Int. With StructType the check is only at runtime (not compile time).

With all that said, the answer to your question is "Yes".

You can create a Dataset using StructType .

scala> val personDS = Seq(("Max", 33), ("Adam", 32), ("Muller", 62)).toDS
personDS: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int]

scala> personDS.show
+------+---+
|    _1| _2|
+------+---+
|   Max| 33|
|  Adam| 32|
|Muller| 62|
+------+---+

You may be wondering why I don't see the column names. That's exactly the reason for a case class that would not only give you the types, but also the names of the columns.

There's one trick you can use however to avoid dealing with case classes if you don't like them.

val withNames = personDS.toDF("name", "age").as[(String, Int)]
scala> withNames.show
+------+---+
|  name|age|
+------+---+
|   Max| 33|
|  Adam| 32|
|Muller| 62|
+------+---+

Here's how you can create the Dataset with a StructType:

import org.apache.spark.sql.types._
import org.apache.spark.sql.Row

val schema = StructType(Seq(
  StructField("name", StringType, true),
  StructField("age", IntegerType, true)
))

val data = Seq(
  Row("Max", 33),
  Row("Adam", 32),
  Row("Muller", 62)
)

val personDF = spark.createDataFrame(
  spark.sparkContext.parallelize(data),
  schema
)

val yourDS = personDF.as[(String, Int)]

yourDS.show()
+------+---+
|  name|age|
+------+---+
|   Max| 33|
|  Adam| 32|
|Muller| 62|
+------+---+

yourDS is a org.apache.spark.sql.Dataset[(String, Int)] .

The personDS in your question is of type org.apache.spark.sql.Dataset[Person] , so this doesn't quite give the same result.

See this post for more info on how to create Datasets.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM