[英]How would I do this dataframe transformation in spark scala?
假設我有這個原裝 dataframe:
var df1 = Seq(("John","Jameson","TRUE","TRUE","FALSE"),("Kevin","Smith","TRUE","FALSE","TRUE"))
.toDF("First Name","Last Name","Married","Employed","Children")
我想將其轉換為適合此模板:
output dataframe 將如下所示:
我想使用“when”條件遍歷“Married”、“Employed”、“Children”列,然后像上面的屏幕截圖一樣填充模板。
任何幫助將不勝感激!
祝你有美好的一天。
您可以將每個選定的列值/名稱配對到Struct
中,將它們分組到Array
中並通過explode
將它們展平,如下所示:
val df = Seq(
("John", "Jameson", "TRUE", "TRUE", "FALSE"),
("Kevin", "Smith", "TRUE", "FALSE", "TRUE")
).toDF("First Name", "Last Name", "Married", "Employed", "Children")
val cols = df.columns.filterNot(_.endsWith("Name"))
// cols: Array[String] = Array(Married, Employed, Children)
df.
withColumn("Temp", explode(array(cols.map(
c => struct(col(c).as("Value"), lit(c).as("Criteria"))): _*))
).
select($"First Name" :: $"Last Name" :: $"Temp.*" :: Nil: _*).
show
// +----------+---------+-----+--------+
// |First Name|Last Name|Value|Criteria|
// +----------+---------+-----+--------+
// | John| Jameson| TRUE| Married|
// | John| Jameson| TRUE|Employed|
// | John| Jameson|FALSE|Children|
// | Kevin| Smith| TRUE| Married|
// | Kevin| Smith|FALSE|Employed|
// | Kevin| Smith| TRUE|Children|
// +----------+---------+-----+--------+
使用 stack() function 的另一種解決方案
val df = Seq(
("John", "Jameson", "TRUE", "TRUE", "FALSE"),
("Kevin", "Smith", "TRUE", "FALSE", "TRUE")
).toDF("First Name", "Last Name", "Married", "Employed", "Children")
df.show(false)
df.createOrReplaceTempView("df")
+----------+---------+-------+--------+--------+
|First Name|Last Name|Married|Employed|Children|
+----------+---------+-------+--------+--------+
|John |Jameson |TRUE |TRUE |FALSE |
|Kevin |Smith |TRUE |FALSE |TRUE |
+----------+---------+-------+--------+--------+
spark.sql("""
select `First Name`, `Last Name`, stack(3,Married,"Married",Employed,"Employed",Children,"Children") (Value,Criteria) from df
""").show(false)
+----------+---------+-----+--------+
|First Name|Last Name|Value|Criteria|
+----------+---------+-----+--------+
|John |Jameson |TRUE |Married |
|John |Jameson |TRUE |Employed|
|John |Jameson |FALSE|Children|
|Kevin |Smith |TRUE |Married |
|Kevin |Smith |FALSE|Employed|
|Kevin |Smith |TRUE |Children|
+----------+---------+-----+--------+
如果要使用 dataframe 步驟:
df.selectExpr( "`First Name`", "`Last Name`", """ stack(3,Married,"Married",Employed,"Employed",Children,"Children") (value,criteria) """ ).show(false)
+----------+---------+-----+--------+
|First Name|Last Name|value|criteria|
+----------+---------+-----+--------+
|John |Jameson |TRUE |Married |
|John |Jameson |TRUE |Employed|
|John |Jameson |FALSE|Children|
|Kevin |Smith |TRUE |Married |
|Kevin |Smith |FALSE|Employed|
|Kevin |Smith |TRUE |Children|
+----------+---------+-----+--------+
或者:
df.select( $"First Name", $"Last Name", expr(""" stack(3,Married,"Married",Employed,"Employed",Children,"Children") (value,criteria) """) ).show(false)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.