[英]How to get max value of corresponding item in many arrays in DataFrame column?
一個DataFrame如下:
import spark.implicits._
val df1 = List(
("id1", Array(0,2)),
("id1",Array(2,1)),
("id2",Array(0,3))
).toDF("id", "value")
+---+------+
| id| value|
+---+------+
|id1|[0, 2]|
|id1|[2, 1]|
|id2|[0, 3]|
+---+------+
我想對ID進行分組以獲得每個值數組的最大池。 id1的最大值是Array(2,2)。 我想要得到的結果是:
import spark.implicits._
val res = List(
("id1", Array(2,2)),
("id2",Array(0,3))
).toDF("id", "value")
+---+------+
| id| value|
+---+------+
|id1|[2, 2]|
|id2|[0, 3]|
+---+------+
import spark.implicits._
val df1 = List(
("id1", Array(0,2,3)),
("id1",Array(2,1,4)),
("id2",Array(0,7,3))
).toDF("id", "value")
val df2rdd = df1.rdd
.map(x => (x(0).toString,x.getSeq[Int](1)))
.reduceByKey((x,y) => {
val arrlength = x.length
var i = 0
val resarr = scala.collection.mutable.ArrayBuffer[Int]()
while(i < arrlength){
if (x(i) >= y(i)){
resarr.append(x(i))
} else {
resarr.append(y(i))
}
i += 1
}
resarr
}).toDF("id","newvalue")
你可以像下面這樣
//Input df
+---+---------+
| id| value|
+---+---------+
|id1|[0, 2, 3]|
|id1|[2, 1, 4]|
|id2|[0, 7, 3]|
+---+---------+
//Solution approach:
import org.apache.spark.sql.functions.udf
val df1=df.groupBy("id").agg(collect_set("value").as("value"))
val maxUDF = udf{(s:Seq[Seq[Int]])=>s.reduceLeft((prev,next)=>prev.zip(next).map(tup=>if(tup._1>tup._2) tup._1 else tup._2))}
df1.withColumn("value",maxUDF(df1.col("value"))).show
//Sample Output:
+---+---------+
| id| value|
+---+---------+
|id1|[2, 2, 4]|
|id2|[0, 7, 3]|
+---+---------+
我希望這能幫到您。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.