简体   繁体   中英

Select a column based on another column's value in Spark Dataframe using Scala

I have a dataframe with 5 columns - sourceId , score_1 , score_3 , score_4 and score_7 . The values of sourceId column can be [1, 3, 4, 7] . I want to convert this into another dataframe that has the columns sourceId and score , where score depends on the value of the sourceId column.

sourceId score_1 score_3 score_4 score_7
1 0.3 0.7 0.45 0.21
4 0.15 0.66 0.73 0.47
7 0.34 0.41 0.78 0.16
3 0.77 0.1 0.93 0.67

So if sourceId = 1 , we select value of score_1 for that record, if sourceId = 3 , we select value of score_3 , and so on...

Result would be

sourceId score
1 0.3
4 0.73
7 0.16
3 0.1

What would be the best way to do this in Spark?

Chaining multiple when expressions on id column values:

val ids = Seq(1, 3, 4, 7)

val scoreCol = ids.foldLeft(lit(null)) { case (acc, id) =>
  when(col("sourceId")===id, col(s"score_$id")).otherwise(acc)
}

val df2 = df.withColumn("score", scoreCol)

Or building a map expression from score_* columns and use it to get score values:

val scoreMap = map(
  df.columns
    .filter(_.startsWith("score_"))
    .flatMap(c => Seq(lit(c.split("_")(1)), col(c))): _*
)

val df2 = df.withColumn("score", scoreMap(col("sourceId")))

Another way of doing it is to create a dynamic when condition:

ArrayList<String> scoresCols = {"score_1", "score_2", ...};
Column actualScoreCol = when(col("sourceId")
  .equalTo(scoresCols.get(0)), col(scoresCols.get(0))
  .cast("string")); // Can add "score_" as suffix and then compare

for (int i = 1; i < scoresCols.size(); i++) {
  actualScoreCol = actualScoreCol
    .when(col("sourceId")
    .equalTo(scoresCols.get(i)), col(scoresCols.get(i))
    .cast("string"));
}
ds = joinedDataset.withColumn("actual", actualCol);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM