Spark map dataframe using the dataframe's schema

Question

I have a dataframe, created from a JSON object. I can query this dataframe and write it to parquet.

Since I infer the schema, I don't necessarily know what's in the dataframe.

Is there a way to the the column names out or map the dataframe using its own schema?

// The results of SQL queries are DataFrames and support all the normal  RDD operations.
// The columns of a row in the result can be accessed by field index:
df.map(t => "Name: " + t(0)).collect().foreach(println)

// or by field name:
df.map(t => "Name: " + t.getAs[String]("name")).collect().foreach(println)

// row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]
df.map(_.getValuesMap[Any](List("name", "age"))).collect().foreach(println)
// Map("name" -> "Justin", "age" -> 19)

I would want to do something like

df.map (_.getValuesMap[Any](ListAll())).collect().foreach(println)
// Map ("name" -> "Justin", "age" -> 19, "color" -> "red")

without knowing the actual amount or names of the columns.

Answer 1

Well, you can but result is rather useless:

val df = Seq(("Justin", 19, "red")).toDF("name", "age", "color")

def getValues(row: Row, names: Seq[String]) = names.map(
  name => name -> row.getAs[Any](name)
).toMap

val names = df.columns
df.rdd.map(getValues(_, names)).first

// scala.collection.immutable.Map[String,Any] = 
//   Map(name -> Justin, age -> 19, color -> red)

To get something actually useful one would a proper mapping between SQL types and Scala types. It is not hard in simple cases but it is hard in general. For example there is built-in type which can be used to represent an arbitrary struct . This can be done using a little bit of meta-programming but arguably it is not worth all the fuss.

Answer 2

You could use an implicit Encoder and perform the map on the DataFrame itself:

implicit class DataFrameEnhancer(df: DataFrame) extends Serializable {
    implicit val encoder = RowEncoder(df.schema)

    implicit def mapNameAndAge(): DataFrame = {
       df.map(row => (row.getAs[String]("name") -> row.getAs[Int]("age")))
    }
}

And invoke it on your dataframe as such:

val df = Seq(("Justin", 19, "red")).toDF("name", "age", "color")
df.mapNameAndAge().first

That way, you don't have to convert your DataFrame into an RDD (in some cases, you don't want to load the entire DF from the disk, just some columns, but the RDD conversion forces you into doing that anyway. Plus, you're using Encoder instead of Kryo (or other Java SerDes), much faster.

Hope it helps :-)

Spark map dataframe using the dataframe's schema

Question

2 answers

solution1
4 ACCPTED

solution2
0 2018-08-30 11:17:05

Spark map dataframe using the dataframe's schema

Question

2 answers

solution1 4 ACCPTED

solution2 0 2018-08-30 11:17:05

solution1
4 ACCPTED

solution2
0 2018-08-30 11:17:05