简体   繁体   中英

How would I do this dataframe transformation in spark scala?

Say I have this original dataframe:

  var df1 = Seq(("John","Jameson","TRUE","TRUE","FALSE"),("Kevin","Smith","TRUE","FALSE","TRUE"))
    .toDF("First Name","Last Name","Married","Employed","Children")

在此处输入图像描述

and I want to convert it so that it fits into this template:

在此处输入图像描述

The output dataframe will look like this:

在此处输入图像描述

I want to iterate over the columns, "Married","Employed","Children", using "when" conditions and then populate the template like the screenshot above.

Any help would truly be appreciated!

Have a great day.

You could pair up each of the selected column values/names into a Struct , group them into an Array and flatten them via explode , as shown below:

val df = Seq(
  ("John", "Jameson", "TRUE", "TRUE", "FALSE"),
  ("Kevin", "Smith", "TRUE", "FALSE", "TRUE")
).toDF("First Name", "Last Name", "Married", "Employed", "Children")

val cols = df.columns.filterNot(_.endsWith("Name"))
// cols: Array[String] = Array(Married, Employed, Children)

df.
  withColumn("Temp", explode(array(cols.map(
    c => struct(col(c).as("Value"), lit(c).as("Criteria"))): _*))
  ).
  select($"First Name" :: $"Last Name" :: $"Temp.*" :: Nil: _*).
  show
// +----------+---------+-----+--------+
// |First Name|Last Name|Value|Criteria|
// +----------+---------+-----+--------+
// |      John|  Jameson| TRUE| Married|
// |      John|  Jameson| TRUE|Employed|
// |      John|  Jameson|FALSE|Children|
// |     Kevin|    Smith| TRUE| Married|
// |     Kevin|    Smith|FALSE|Employed|
// |     Kevin|    Smith| TRUE|Children|
// +----------+---------+-----+--------+

Another solution using stack() function

val df = Seq(
              ("John", "Jameson", "TRUE", "TRUE", "FALSE"),
              ("Kevin", "Smith", "TRUE", "FALSE", "TRUE")
).toDF("First Name", "Last Name", "Married", "Employed", "Children")
df.show(false)
df.createOrReplaceTempView("df")

+----------+---------+-------+--------+--------+
|First Name|Last Name|Married|Employed|Children|
+----------+---------+-------+--------+--------+
|John      |Jameson  |TRUE   |TRUE    |FALSE   |
|Kevin     |Smith    |TRUE   |FALSE   |TRUE    |
+----------+---------+-------+--------+--------+

spark.sql("""
select `First Name`, `Last Name`, stack(3,Married,"Married",Employed,"Employed",Children,"Children") (Value,Criteria) from df
""").show(false)

+----------+---------+-----+--------+
|First Name|Last Name|Value|Criteria|
+----------+---------+-----+--------+
|John      |Jameson  |TRUE |Married |
|John      |Jameson  |TRUE |Employed|
|John      |Jameson  |FALSE|Children|
|Kevin     |Smith    |TRUE |Married |
|Kevin     |Smith    |FALSE|Employed|
|Kevin     |Smith    |TRUE |Children|
+----------+---------+-----+--------+

If you want to use dataframe steps:

df.selectExpr( "`First Name`", "`Last Name`",  """ stack(3,Married,"Married",Employed,"Employed",Children,"Children") (value,criteria) """ ).show(false)

+----------+---------+-----+--------+
|First Name|Last Name|value|criteria|
+----------+---------+-----+--------+
|John      |Jameson  |TRUE |Married |
|John      |Jameson  |TRUE |Employed|
|John      |Jameson  |FALSE|Children|
|Kevin     |Smith    |TRUE |Married |
|Kevin     |Smith    |FALSE|Employed|
|Kevin     |Smith    |TRUE |Children|
+----------+---------+-----+--------+

Or:

df.select( $"First Name", $"Last Name", expr(""" stack(3,Married,"Married",Employed,"Employed",Children,"Children") (value,criteria) """) ).show(false)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM