[英]Spark create column name based on other column values
我是新手,需要幫助以以下格式轉換此數據:
我有這種格式的數據:
+----------+-------------------------+-------------------+---------+------+
| id | values | creation date | leadTime| span |
+----------+-------------------------+-------------------+---------+--+---+
|id_1 |[[v1, 0.368], [v2, 0.5]] | 2020-07-15 | 16 | 15 |
|id_2 |[[v1, 0.368], [v2, 0.4]] | 2020-07-15 | 16 | 15 |
|id_3 |[[v1, 0.468], [v2, 0.3]] | 2020-07-15 | 16 | 15 |
|id_4 |[[v1, 0.368], [v2, 0.3]] | 2020-07-15 | 16 | 15 |
|id_5 |[[v1, 0.668], [v2, 0.1]] | 2020-07-15 | 16 | 15 |
|id_6 |[[v1, 0.168], [v2, 0.2]] | 2020-07-15 | 16 | 15 |
+----------+-------------------------+-------------------+---------+------+
通過使用列字段中的值,我需要以下格式的數據:
使用leadTime和span列值創建具有列名的新列
+----------+--------------+--------------------+--------------------+
| id |creation date | final_v1_16_15_wk | final_v2_16_15_wk |
+----------+--------------+--------------------+--------------------+
|id_1 |2020-07-15 | 0.368 | 0.5 |
|id_2 |2020-07-15 | 0.368 | 0.4 |
|id_3 |2020-07-15 | 0.468 | 0.3 |
|id_4 |2020-07-15 | 0.368 | 0.3 |
|id_5 |2020-07-15 | 0.668 | 0.1 |
|id_6 |2020-07-15 | 0.168 | 0.2 |
+----------+--------------+--------------------+--------------------+
此 DF 的另一個示例:
val df = Seq(("id_1", Map("v1" -> 0.368, "v2" -> 0.5), "2020-07-15", 16, 15),("id_1", Map("v1" -> 0.564, "v2" -> 0.78), "2020-07-15", 17, 18),("id_2", Map("v1" -> 0.468, "v2" -> 0.3), "2020-07-15", 16, 15),("id_2", Map("v1" -> 0.657, "v2" -> 0.65), "2020-07-15", 17, 18)).toDF("id", "values", "creation date", "leadTime", "span")
Output 格式如下:
| id |creation date| final_v1_16_15_wk | final_v1_17_18_wk |final_v2_16_15_wk | final_v2_17_18_wk |
|id_1 | 2020-07-15 | 0.368 | 0.564 | 0.5 | 0.78 |
|id_2 | 2020-07-15 | 0.468 | 0.657 | 0.3 | 0.65 |
嘗試使用以下邏輯生成列名稱/值,但它不起作用:
val modDF = finalDF.withColumn("final_" + newFinalDF("values").getItem(0).getItem("_1") + "_" + newFinalDF("leadTime") + "_" + newFinalDF("span") + "_wk", $"values".getItem(0).getItem("_2"));
import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(
("id_1", Map("v1" -> 0.368, "v2" -> 0.5), "2020-07-15", 16, 15),
("id_2", Map("v1" -> 0.368, "v2" -> 0.4), "2020-07-15", 16, 15),
("id_3", Map("v1" -> 0.468, "v2" -> 0.3), "2020-07-15", 16, 15),
("id_4", Map("v1" -> 0.368, "v2" -> 0.3), "2020-07-15", 16, 15),
("id_5", Map("v1" -> 0.668, "v2" -> 0.1), "2020-07-15", 16, 15),
("id_6", Map("v1" -> 0.168, "v2" -> 0.2), "2020-07-15", 16, 15)
).toDF("id", "values", "creation date", "leadTime", "span")
val c1 = df.select('leadTime).first.getInt(0)
val c2 = df.select('span).first.getInt(0)
val df1 = df
.select('id,
col("creation date"),
col("values")("v1").as("v1"),
col("values")("v2").as("v2"))
.withColumnRenamed("v1", s"final_v1_${c1}_${c2}_wk")
.withColumnRenamed("v2", s"final_v2_${c1}_${c2}_wk")
df1.show(false)
// +----+-------------+-----------------+-----------------+
// |id |creation date|final_v1_16_15_wk|final_v2_16_15_wk|
// +----+-------------+-----------------+-----------------+
// |id_1|2020-07-15 |0.368 |0.5 |
// |id_2|2020-07-15 |0.368 |0.4 |
// |id_3|2020-07-15 |0.468 |0.3 |
// |id_4|2020-07-15 |0.368 |0.3 |
// |id_5|2020-07-15 |0.668 |0.1 |
// |id_6|2020-07-15 |0.168 |0.2 |
// +----+-------------+-----------------+-----------------+
// other variant
val df3 = df
.withColumn(s"final_v1_${c1}_${c2}_wk", col("values")("v1"))
.withColumn(s"final_v2_${c1}_${c2}_wk", col("values")("v2"))
.select('id,
col("creation date"),
col(s"final_v1_${c1}_${c2}_wk"),
col(s"final_v2_${c1}_${c2}_wk")
)
回復評論
import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray
import org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK
import spark.implicits._
val df30 = Seq(
("id_1", Map("v1" -> 0.368, "v2" -> 0.5), "2020-07-15", 16, 15),
("id_1", Map("v1" -> 0.564, "v2" -> 0.78), "2020-07-15", 17, 18),
("id_2", Map("v1" -> 0.468, "v2" -> 0.3), "2020-07-15", 16, 15),
("id_2", Map("v1" -> 0.657, "v2" -> 0.65), "2020-07-15", 17, 18))
.toDF("id", "values", "creation date", "leadTime", "span")
val df31 = df30.groupBy("id", "creation date")
.agg(
collect_list(col("values")).alias("values"),
collect_list(col("leadTime")).alias("leadTime"),
collect_list(col("span")).alias("span")
).persist(MEMORY_AND_DISK)
val leadTimeArray = df31.select('leadTime).first.getAs[WrappedArray[Int]](0).toArray
val spanArray = df31.select('span).first.getAs[WrappedArray[Int]](0).toArray
val valuesArrayNew = df31.select('values).first.getAs[WrappedArray[Map[String, Float]]](0).toList
val newCols = valuesArrayNew
.zipWithIndex
.flatMap{case(v, i) => v.keys.map(k => s"final_${k}_${leadTimeArray(i)}_${spanArray(i)}_wk")}
val resDF = newCols.foldLeft(df31){(tempDF, colName) =>
tempDF.withColumn(colName,
col("values")(newCols.indexOf(colName) / 2)(if (colName.contains("v1")) "v1" else "v2"))
}.drop("values", "leadTime", "span")
resDF.show(false)
// +----+-------------+-----------------+-----------------+-----------------+-----------------+
// |id |creation date|final_v1_16_15_wk|final_v2_16_15_wk|final_v1_17_18_wk|final_v2_17_18_wk|
// +----+-------------+-----------------+-----------------+-----------------+-----------------+
// |id_1|2020-07-15 |0.368 |0.5 |0.564 |0.78 |
// |id_2|2020-07-15 |0.468 |0.3 |0.657 |0.65 |
// +----+-------------+-----------------+-----------------+-----------------+-----------------+
df3.unpersist()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.