![](/img/trans.png)
[英]Add derived column (as array of struct) based on values and ordering of other columns in Spark Scala dataframe
[英]Spark scala modify DataFrame columns based on other DataFrame
我是 spark 和 scala 的新手,想知道如何在 2 个数据帧之间执行操作。 就我而言,我有这两个数据框:
DF1:
ID_EMPLOYEE sup_id_1 desc_1 sup_id_2 desc_2 ... sup_id_18 desc_18 sup_id_19 desc_19
AAAAAAAA SUP_ID1 SUP_ID2 ... SUP_ID3 SUP_ID4
BBBBBBBBB SUP_ID4 SUP_ID6 ... SUP_ID6 SUP_ID6
CCCCCCCCC SUP_ID5 SUP_ID5 ... SUP_ID5 SUP_ID5
DDDDDDDD SUP_ID7 SUP_ID7 ... SUP_ID7 SUP_ID7
和
DF2:
Key Desc
SUP_ID1 Desc1
SUP_ID2 Desc2
SUP_ID3 Desc3
SUP_ID4 Desc4
SUP_ID5 Desc5
SUP_ID6 Desc6
SUP_ID7 Desc7
我想从 DF1 修改基于 DF2 的desc_*
列,因为在 DF1 中它们是空的。 填充它的方法是查看 DF1 的sup_id_*
列和 DF2 的Key
列,在 DF1 的desc_*
列中添加 DF2 的Desc
列的值。
我不知道什么是最简单的方法,因为据我所知,我只能考虑将 DataFrames 视为 SQL 表并进行尽可能多的连接,因为我有desc_*
列,但这不是最有效的方法.
import spark.implicits._
import org.apache.spark.sql.functions.{col}
case class Source1(
idEmploye: String,
sup_id_1: String,
desc_1: Option[String],
sup_id_2: String,
desc_2: Option[String],
sup_id_3: String,
desc_3: Option[String],
sup_id_4: String,
desc_4: Option[String],
sup_id_5: String,
desc_5: Option[String],
sup_id_6: String,
desc_6: Option[String]
)
val source1 = Seq(
Source1("AAAAAAAA", "SUP_ID1", None, "SUP_ID2", None, "SUP_ID3", None, "SUP_ID4", None, "SUP_ID5", None, "SUP_ID8", None),
Source1("BBBBBBBBB", "SUP_ID4", None, "SUP_ID6", None, "SUP_ID6", None, "SUP_ID6", None, "SUP_ID6", None, "SUP_ID8", None),
Source1("CCCCCCCCC", "SUP_ID5", None, "SUP_ID5", None, "SUP_ID5", None, "SUP_ID5", None, "SUP_ID5", None, "SUP_ID8", None),
Source1("DDDDDDDD", "SUP_ID7", None, "SUP_ID7", None, "SUP_ID7", None, "SUP_ID7", None, "SUP_ID7", None, "SUP_ID8", None)
).toDF()
val source2 = Seq(
("SUP_ID1", "Desc1"),
("SUP_ID2", "Desc2"),
("SUP_ID3", "Desc3"),
("SUP_ID4", "Desc4"),
("SUP_ID5", "Desc5"),
("SUP_ID6", "Desc6"),
("SUP_ID7", "Desc7")
).toDF("Key", "Desc")
val listColumns = 1 to ((source1.columns.length - 1) / 2) by 1
val source12 = listColumns.foldLeft(source1){(memoDF, colName) =>
val df1 = memoDF.join(source2,
memoDF.col(s"sup_id_$colName") === source2.col("key"),
"left_outer")
df1.drop("key", s"desc_${colName}")
.withColumnRenamed("Desc", s"desc_$colName")
}
val resDF = source12.select(source1.columns.map(name => col(name)):_*)
resDF.printSchema
// root
// |-- idEmploye: string (nullable = true)
// |-- sup_id_1: string (nullable = true)
// |-- desc_1: string (nullable = true)
// |-- sup_id_2: string (nullable = true)
// |-- desc_2: string (nullable = true)
// |-- sup_id_3: string (nullable = true)
// |-- desc_3: string (nullable = true)
// |-- sup_id_4: string (nullable = true)
// |-- desc_4: string (nullable = true)
// |-- sup_id_5: string (nullable = true)
// |-- desc_5: string (nullable = true)
// |-- sup_id_6: string (nullable = true)
// |-- desc_6: string (nullable = true)
resDF.show(false)
// +---------+--------+------+--------+------+--------+------+--------+------+--------+------+--------+------+
// |idEmploye|sup_id_1|desc_1|sup_id_2|desc_2|sup_id_3|desc_3|sup_id_4|desc_4|sup_id_5|desc_5|sup_id_6|desc_6|
// +---------+--------+------+--------+------+--------+------+--------+------+--------+------+--------+------+
// |AAAAAAAA |SUP_ID1 |Desc1 |SUP_ID2 |Desc2 |SUP_ID3 |Desc3 |SUP_ID4 |Desc4 |SUP_ID5 |Desc5 |SUP_ID8 |null |
// |BBBBBBBBB|SUP_ID4 |Desc4 |SUP_ID6 |Desc6 |SUP_ID6 |Desc6 |SUP_ID6 |Desc6 |SUP_ID6 |Desc6 |SUP_ID8 |null |
// |CCCCCCCCC|SUP_ID5 |Desc5 |SUP_ID5 |Desc5 |SUP_ID5 |Desc5 |SUP_ID5 |Desc5 |SUP_ID5 |Desc5 |SUP_ID8 |null |
// |DDDDDDDD |SUP_ID7 |Desc7 |SUP_ID7 |Desc7 |SUP_ID7 |Desc7 |SUP_ID7 |Desc7 |SUP_ID7 |Desc7 |SUP_ID8 |null |
// +---------+--------+------+--------+------+--------+------+--------+------+--------+------+--------+------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.