[英]How add new column based on existing column in spark scala
Halo 光环
i'm done to build recommendation using Mllib ALS in apache spark, with output 我已经完成了在Apache Spark中使用Mllib ALS建立推荐并输出的建议
user | product | rating
1 | 20 | 0.002
1 | 30 | 0.001
1 | 10 | 0.003
2 | 20 | 0.002
2 | 30 | 0.001
2 | 10 | 0.003
but i need to change data structure based on sort by rating, like that : 但是我需要根据评分来更改数据结构,像这样:
user | product | rating | number_rangking
1 | 10 | 0.003 | 1
1 | 20 | 0.002 | 2
1 | 30 | 0.001 | 3
2 | 10 | 0.002 | 1
2 | 20 | 0.001 | 2
2 | 30 | 0.003 | 3
how can i do that? 我怎样才能做到这一点? maybe any one can give me a clue...
也许任何人都可以给我一个线索...
thx 谢谢
All you need is a window functions depending on details you choose either rank
or rowNumber
所有你需要的是一个窗口的功能取决于细节,你选择使用
rank
或rowNumber
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.rank
val w = Window.partitionBy($"user").orderBy($"rating".desc)
df.select($"*", rank.over(w).alias("number_rangking")).show
// +----+-------+------+---------------+
// |user|product|rating|number_rangking|
// +----+-------+------+---------------+
// | 1| 10| 0.003| 1|
// | 1| 20| 0.002| 2|
// | 1| 30| 0.001| 3|
// | 2| 10| 0.003| 1|
// | 2| 20| 0.002| 2|
// | 2| 30| 0.001| 3|
// +----+-------+------+---------------+
Using plain RDD you can groupByKey
, process locally and flatMap
: 使用普通的RDD,您可以
groupByKey
,本地处理和flatMap
:
rdd
// Convert to PairRDD
.map{case (user, product, rating) => (user, (product, rating))}
.groupByKey
.flatMap{case (user, vals) => vals.toArray
.sortBy(-_._2) // Sort by rating
.zipWithIndex // Add index
// Yield final values
.map{case ((product, rating), idx) => (user, product, rating, idx + 1)}}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.