How to create Dataset (not DataFrame) without using case class but using StructType?

Question

How can I create Dataset using StructType ?

We can create a Dataset as follows:

case class Person(name: String, age: Int)

val personDS = Seq(Person("Max", 33), Person("Adam", 32), Person("Muller", 
62)).toDS()
personDS.show()

Is there a way to create a Dataset without using a case class?

I'd like to create a DataFrame using a case class and using StructType .

Answer 1

If you know how to create DataFrame, you already now how to create Dataset :)

DataFrame = Dataset[Row].

What it means? Try:

val df : DataFrame = spark.createDataFrame(...) // with StructType
import org.apache.spark.sql._
val ds : Dataset[Row] = df; // no error, as DataFrame is only a type alias of Dataset[Row]

Answer 2

That's an interesting question in a sense that I don't see a reason why one would want it.

How can I create Dataset using "StructType"

I'd then ask a very similar question...

Why would you like to "trade" a case class with a StructType ? What would that give you that a case class could not?

The reason you use a case class is that it can offer you two things at once:

Describe your schema quickly, nicely and type-safely
Working with your data becomes type-safe

Regarding 1. as a Scala developer, you will define business objects that describe your data. You will have to do it anyway (unless you like tuples and _1 and such).

Regarding type-safety (in both 1. and 2.) is about transforming your data to leverage the Scala compiler that can help find places where you expect a String but have an Int. With StructType the check is only at runtime (not compile time).

With all that said, the answer to your question is "Yes".

You can create a Dataset using StructType .

scala> val personDS = Seq(("Max", 33), ("Adam", 32), ("Muller", 62)).toDS
personDS: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int]

scala> personDS.show
+------+---+
|    _1| _2|
+------+---+
|   Max| 33|
|  Adam| 32|
|Muller| 62|
+------+---+

You may be wondering why I don't see the column names. That's exactly the reason for a case class that would not only give you the types, but also the names of the columns.

There's one trick you can use however to avoid dealing with case classes if you don't like them.

val withNames = personDS.toDF("name", "age").as[(String, Int)]
scala> withNames.show
+------+---+
|  name|age|
+------+---+
|   Max| 33|
|  Adam| 32|
|Muller| 62|
+------+---+

Answer 3

Here's how you can create the Dataset with a StructType:

import org.apache.spark.sql.types._
import org.apache.spark.sql.Row

val schema = StructType(Seq(
  StructField("name", StringType, true),
  StructField("age", IntegerType, true)
))

val data = Seq(
  Row("Max", 33),
  Row("Adam", 32),
  Row("Muller", 62)
)

val personDF = spark.createDataFrame(
  spark.sparkContext.parallelize(data),
  schema
)

val yourDS = personDF.as[(String, Int)]

yourDS.show()

+------+---+
|  name|age|
+------+---+
|   Max| 33|
|  Adam| 32|
|Muller| 62|
+------+---+

yourDS is a org.apache.spark.sql.Dataset[(String, Int)] .

The personDS in your question is of type org.apache.spark.sql.Dataset[Person] , so this doesn't quite give the same result.

See this post for more info on how to create Datasets.

How to create Dataset (not DataFrame) without using case class but using StructType?

Question

3 answers

solution1
6 2017-09-18 18:35:24

solution2
2 2017-09-19 07:03:57

solution3
0 2021-01-27 15:29:23

How to create Dataset (not DataFrame) without using case class but using StructType?

Question

3 answers

solution1 6 2017-09-18 18:35:24

solution2 2 2017-09-19 07:03:57

solution3 0 2021-01-27 15:29:23

solution1
6 2017-09-18 18:35:24

solution2
2 2017-09-19 07:03:57

solution3
0 2021-01-27 15:29:23