Select 基于 Spark 中另一列值的列 Dataframe 使用 Scala

Question

I have a dataframe with 5 columns - sourceId , score_1 , score_3 , score_4 and score_7 .我有一个 dataframe 有 5 列 - sourceId 、 score_1 、 score_3 、 score_4和score_7 。 The values of sourceId column can be [1, 3, 4, 7] . sourceId列的值可以是[1, 3, 4, 7] 。 I want to convert this into another dataframe that has the columns sourceId and score , where score depends on the value of the sourceId column.我想将其转换为另一个 dataframe，它具有列sourceId和score ，其中 score 取决于sourceId列的值。

sourceId来源ID	score_1分数_1	score_3 score_3	score_4 score_4	score_7 score_7
1 1个	0.3 0.3	0.7 0.7	0.45 0.45	0.21 0.21
4 4个	0.15 0.15	0.66 0.66	0.73 0.73	0.47 0.47
7 7	0.34 0.34	0.41 0.41	0.78 0.78	0.16 0.16
3 3个	0.77 0.77	0.1 0.1	0.93 0.93	0.67 0.67

So if sourceId = 1 , we select value of score_1 for that record, if sourceId = 3 , we select value of score_3 , and so on...因此，如果sourceId = 1 ，我们score_1该记录的 score_1 值，如果sourceId = 3 ，我们 select score_3的值，依此类推......

Result would be结果是

sourceId来源ID	score分数
1 1个	0.3 0.3
4 4个	0.73 0.73
7 7	0.16 0.16
3 3个	0.1 0.1

What would be the best way to do this in Spark?在 Spark 中执行此操作的最佳方法是什么？

Answer 1

Chaining multiple when expressions on id column values:在id列值上链接多个when表达式：

val ids = Seq(1, 3, 4, 7)

val scoreCol = ids.foldLeft(lit(null)) { case (acc, id) =>
  when(col("sourceId")===id, col(s"score_$id")).otherwise(acc)
}

val df2 = df.withColumn("score", scoreCol)

Or building a map expression from score_* columns and use it to get score values:或者从score_*列构建 map 表达式并使用它来获取score ：

val scoreMap = map(
  df.columns
    .filter(_.startsWith("score_"))
    .flatMap(c => Seq(lit(c.split("_")(1)), col(c))): _*
)

val df2 = df.withColumn("score", scoreMap(col("sourceId")))

Answer 2

Another way of doing it is to create a dynamic when condition:另一种方法是创建一个动态的when条件：

ArrayList<String> scoresCols = {"score_1", "score_2", ...};
Column actualScoreCol = when(col("sourceId")
  .equalTo(scoresCols.get(0)), col(scoresCols.get(0))
  .cast("string")); // Can add "score_" as suffix and then compare

for (int i = 1; i < scoresCols.size(); i++) {
  actualScoreCol = actualScoreCol
    .when(col("sourceId")
    .equalTo(scoresCols.get(i)), col(scoresCols.get(i))
    .cast("string"));
}
ds = joinedDataset.withColumn("actual", actualCol);

Select 基于 Spark 中另一列值的列 Dataframe 使用 Scala

问题描述

2 个解决方案

解决方案1
0 2022-02-03 08:08:06

解决方案2
0 2022-11-28 14:58:15

Select 基于 Spark 中另一列值的列 Dataframe 使用 Scala

问题描述

2 个解决方案

解决方案1 0 2022-02-03 08:08:06

解决方案2 0 2022-11-28 14:58:15

解决方案1
0 2022-02-03 08:08:06

解决方案2
0 2022-11-28 14:58:15