简体   繁体   English

Select 基于 Spark 中另一列值的列 Dataframe 使用 Scala

[英]Select a column based on another column's value in Spark Dataframe using Scala

I have a dataframe with 5 columns - sourceId , score_1 , score_3 , score_4 and score_7 .我有一个 dataframe 有 5 列 - sourceIdscore_1score_3score_4score_7 The values of sourceId column can be [1, 3, 4, 7] . sourceId列的值可以是[1, 3, 4, 7] I want to convert this into another dataframe that has the columns sourceId and score , where score depends on the value of the sourceId column.我想将其转换为另一个 dataframe,它具有列sourceIdscore ,其中 score 取决于sourceId列的值。

sourceId来源ID score_1分数_1 score_3 score_3 score_4 score_4 score_7 score_7
1 1个 0.3 0.3 0.7 0.7 0.45 0.45 0.21 0.21
4 4个 0.15 0.15 0.66 0.66 0.73 0.73 0.47 0.47
7 7 0.34 0.34 0.41 0.41 0.78 0.78 0.16 0.16
3 3个 0.77 0.77 0.1 0.1 0.93 0.93 0.67 0.67

So if sourceId = 1 , we select value of score_1 for that record, if sourceId = 3 , we select value of score_3 , and so on...因此,如果sourceId = 1 ,我们score_1该记录的 score_1 值,如果sourceId = 3 ,我们 select score_3的值,依此类推......

Result would be结果是

sourceId来源ID score分数
1 1个 0.3 0.3
4 4个 0.73 0.73
7 7 0.16 0.16
3 3个 0.1 0.1

What would be the best way to do this in Spark?在 Spark 中执行此操作的最佳方法是什么?

Chaining multiple when expressions on id column values:id列值上链接多个when表达式:

val ids = Seq(1, 3, 4, 7)

val scoreCol = ids.foldLeft(lit(null)) { case (acc, id) =>
  when(col("sourceId")===id, col(s"score_$id")).otherwise(acc)
}

val df2 = df.withColumn("score", scoreCol)

Or building a map expression from score_* columns and use it to get score values:或者从score_*列构建 map 表达式并使用它来获取score

val scoreMap = map(
  df.columns
    .filter(_.startsWith("score_"))
    .flatMap(c => Seq(lit(c.split("_")(1)), col(c))): _*
)

val df2 = df.withColumn("score", scoreMap(col("sourceId")))

Another way of doing it is to create a dynamic when condition:另一种方法是创建一个动态的when条件:

ArrayList<String> scoresCols = {"score_1", "score_2", ...};
Column actualScoreCol = when(col("sourceId")
  .equalTo(scoresCols.get(0)), col(scoresCols.get(0))
  .cast("string")); // Can add "score_" as suffix and then compare

for (int i = 1; i < scoresCols.size(); i++) {
  actualScoreCol = actualScoreCol
    .when(col("sourceId")
    .equalTo(scoresCols.get(i)), col(scoresCols.get(i))
    .cast("string"));
}
ds = joinedDataset.withColumn("actual", actualCol);

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何根据另一列的值填充 Spark DataFrame 列? - How to populate a Spark DataFrame column based on another column's value? 根据Apache Spark Scala中的列数据类型将数据框中的列选择为另一个数据框 - Select columns from a dataframe into another dataframe based on column datatype in Apache Spark Scala 基于另一列更新 spark dataframe 中的列值 - Update a column value in a spark dataframe based another column 基于scala中的条件触发列的数据帧聚合 - spark dataframe aggregation of column based on condition in scala 使用 Scala 在列中删除包含特定值的 Spark DataFrame 行 - Drop rows of Spark DataFrame that contain specific value in column using Scala 根据另一列和字典值填充熊猫的 dataframe 列 - Populate a panda's dataframe column based on another column and dictionary value 基于列索引的 Spark Dataframe 选择 - Spark Dataframe select based on column index 将Scala Spark数据框的结果合并为另一个数据框的列中的结果数组 - Merging the results of a scala spark dataframe as an array of results in another dataframe's column scala/spark - 将数据框分组并从其他列中选择值作为数据框 - scala/spark - group dataframe and select values from other column as dataframe 如何使用scala查找由火花数据框中的另一列值分组的列中的数组总和 - How to find sum of arrays in a column which is grouped by another column values in a spark dataframe using scala
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM