I have a dataframe with 5 columns - sourceId
, score_1
, score_3
, score_4
and score_7
. The values of sourceId
column can be [1, 3, 4, 7]
. I want to convert this into another dataframe that has the columns sourceId
and score
, where score depends on the value of the sourceId
column.
sourceId | score_1 | score_3 | score_4 | score_7 |
---|---|---|---|---|
1 | 0.3 | 0.7 | 0.45 | 0.21 |
4 | 0.15 | 0.66 | 0.73 | 0.47 |
7 | 0.34 | 0.41 | 0.78 | 0.16 |
3 | 0.77 | 0.1 | 0.93 | 0.67 |
So if sourceId = 1
, we select value of score_1
for that record, if sourceId = 3
, we select value of score_3
, and so on...
Result would be
sourceId | score |
---|---|
1 | 0.3 |
4 | 0.73 |
7 | 0.16 |
3 | 0.1 |
What would be the best way to do this in Spark?
Chaining multiple when
expressions on id
column values:
val ids = Seq(1, 3, 4, 7)
val scoreCol = ids.foldLeft(lit(null)) { case (acc, id) =>
when(col("sourceId")===id, col(s"score_$id")).otherwise(acc)
}
val df2 = df.withColumn("score", scoreCol)
Or building a map expression from score_*
columns and use it to get score
values:
val scoreMap = map(
df.columns
.filter(_.startsWith("score_"))
.flatMap(c => Seq(lit(c.split("_")(1)), col(c))): _*
)
val df2 = df.withColumn("score", scoreMap(col("sourceId")))
Another way of doing it is to create a dynamic when
condition:
ArrayList<String> scoresCols = {"score_1", "score_2", ...};
Column actualScoreCol = when(col("sourceId")
.equalTo(scoresCols.get(0)), col(scoresCols.get(0))
.cast("string")); // Can add "score_" as suffix and then compare
for (int i = 1; i < scoresCols.size(); i++) {
actualScoreCol = actualScoreCol
.when(col("sourceId")
.equalTo(scoresCols.get(i)), col(scoresCols.get(i))
.cast("string"));
}
ds = joinedDataset.withColumn("actual", actualCol);
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.