简体   繁体   English

如何基于Spark Scala中的现有列添加新列

[英]How add new column based on existing column in spark scala

Halo 光环

i'm done to build recommendation using Mllib ALS in apache spark, with output 我已经完成了在Apache Spark中使用Mllib ALS建立推荐并输出的建议

user | product | rating
    1 | 20 | 0.002
    1 | 30 | 0.001
    1 | 10 | 0.003
    2 | 20 | 0.002
    2 | 30 | 0.001
    2 | 10 | 0.003

but i need to change data structure based on sort by rating, like that : 但是我需要根据评分来更改数据结构,像这样:

user | product | rating | number_rangking
    1 | 10 | 0.003 | 1
    1 | 20 | 0.002 | 2 
    1 | 30 | 0.001 | 3
    2 | 10 | 0.002 | 1
    2 | 20 | 0.001 | 2
    2 | 30 | 0.003 | 3

how can i do that? 我怎样才能做到这一点? maybe any one can give me a clue... 也许任何人都可以给我一个线索...

thx 谢谢

All you need is a window functions depending on details you choose either rank or rowNumber 所有你需要的是一个窗口的功能取决于细节,你选择使用rankrowNumber

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.rank

val w = Window.partitionBy($"user").orderBy($"rating".desc)

df.select($"*", rank.over(w).alias("number_rangking")).show
// +----+-------+------+---------------+
// |user|product|rating|number_rangking|
// +----+-------+------+---------------+
// |   1|     10| 0.003|              1|
// |   1|     20| 0.002|              2|
// |   1|     30| 0.001|              3|
// |   2|     10| 0.003|              1|
// |   2|     20| 0.002|              2|
// |   2|     30| 0.001|              3|
// +----+-------+------+---------------+

Using plain RDD you can groupByKey , process locally and flatMap : 使用普通的RDD,您可以groupByKey ,本地处理和flatMap

rdd
  // Convert to PairRDD
  .map{case (user, product, rating) => (user, (product, rating))}
  .groupByKey 
  .flatMap{case (user, vals) => vals.toArray
    .sortBy(-_._2) // Sort by rating
    .zipWithIndex // Add index
    // Yield final values
    .map{case ((product, rating), idx) => (user, product, rating, idx + 1)}}

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用Scala / Spark 2.2将列添加到现有DataFrame并使用window函数在新列中添加特定行 - How to add a column to the existing DataFrame and using window function to add specific rows in the new column using Scala/Spark 2.2 如何添加新列以触发数据框取决于multipme现有列? - how to add new column to spark dataframe depend on multipme existing column? 如何进行 groupby 排名并将其作为列添加到 spark scala 中的现有 dataframe? - How to do a groupby rank and add it as a column to existing dataframe in spark scala? Spark,在Scala中添加具有相同值的新列 - Spark, add new Column with the same value in Scala 关于如何在 Scala 中使用随机值向现有 DataFrame 添加新列 - About how to add a new column to an existing DataFrame with random values in Scala Spark Scala DF。 在处理同一列的某些行时将新列添加到DF - Spark Scala DF. add a new Column to DF based in processing of some rows of the same column 使用 scala 根据 Spark DataFrame 中现有列的聚合添加新列 - Adding new Columns based on aggregation on existing column in Spark DataFrame using scala Spark Scala SQL 基于另一列创建新列 - Spark Scala SQL making a new Column based on another column Spark基于现有列的映射值创建新列 - Spark creating a new column based on a mapped value of an existing column 将具有文字值的新列添加到 Spark Scala 中 Dataframe 中的结构列 - Add new column with literal value to a struct column in Dataframe in Spark Scala
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM