[英]Select a column based on another column's value in Spark Dataframe using Scala
I have a dataframe with 5 columns - sourceId
, score_1
, score_3
, score_4
and score_7
.我有一个 dataframe 有 5 列 -
sourceId
、 score_1
、 score_3
、 score_4
和score_7
。 The values of sourceId
column can be [1, 3, 4, 7]
. sourceId
列的值可以是[1, 3, 4, 7]
。 I want to convert this into another dataframe that has the columns sourceId
and score
, where score depends on the value of the sourceId
column.我想将其转换为另一个 dataframe,它具有列
sourceId
和score
,其中 score 取决于sourceId
列的值。
sourceId![]() |
score_1![]() |
score_3 ![]() |
score_4 ![]() |
score_7 ![]() |
---|---|---|---|---|
1 ![]() |
0.3 ![]() |
0.7 ![]() |
0.45 ![]() |
0.21 ![]() |
4 ![]() |
0.15 ![]() |
0.66 ![]() |
0.73 ![]() |
0.47 ![]() |
7 ![]() |
0.34 ![]() |
0.41 ![]() |
0.78 ![]() |
0.16 ![]() |
3 ![]() |
0.77 ![]() |
0.1 ![]() |
0.93 ![]() |
0.67 ![]() |
So if sourceId = 1
, we select value of score_1
for that record, if sourceId = 3
, we select value of score_3
, and so on...因此,如果
sourceId = 1
,我们score_1
该记录的 score_1 值,如果sourceId = 3
,我们 select score_3
的值,依此类推......
Result would be结果是
sourceId![]() |
score![]() |
---|---|
1 ![]() |
0.3 ![]() |
4 ![]() |
0.73 ![]() |
7 ![]() |
0.16 ![]() |
3 ![]() |
0.1 ![]() |
What would be the best way to do this in Spark?在 Spark 中执行此操作的最佳方法是什么?
Chaining multiple when
expressions on id
column values:在
id
列值上链接多个when
表达式:
val ids = Seq(1, 3, 4, 7)
val scoreCol = ids.foldLeft(lit(null)) { case (acc, id) =>
when(col("sourceId")===id, col(s"score_$id")).otherwise(acc)
}
val df2 = df.withColumn("score", scoreCol)
Or building a map expression from score_*
columns and use it to get score
values:或者从
score_*
列构建 map 表达式并使用它来获取score
:
val scoreMap = map(
df.columns
.filter(_.startsWith("score_"))
.flatMap(c => Seq(lit(c.split("_")(1)), col(c))): _*
)
val df2 = df.withColumn("score", scoreMap(col("sourceId")))
Another way of doing it is to create a dynamic when
condition:另一种方法是创建一个动态的
when
条件:
ArrayList<String> scoresCols = {"score_1", "score_2", ...};
Column actualScoreCol = when(col("sourceId")
.equalTo(scoresCols.get(0)), col(scoresCols.get(0))
.cast("string")); // Can add "score_" as suffix and then compare
for (int i = 1; i < scoresCols.size(); i++) {
actualScoreCol = actualScoreCol
.when(col("sourceId")
.equalTo(scoresCols.get(i)), col(scoresCols.get(i))
.cast("string"));
}
ds = joinedDataset.withColumn("actual", actualCol);
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.