简体   繁体   English

Spark根据row_number的最大值和另一列的字符串值创建一个新的字符串列

[英]Spark create a new string column based on max value of row_number and string value of another column

Suppose we have假设我们有

id, hit, item, row_number
1234, 1, item1, 1
1234, 2, item2, 2
2345, 2, item1, 1
2345, 2, item2, 2
2345, 4, item3, 3

where row_number was created from windows function partitioned by id on hit in ascending order.其中row_number从窗口创建功能分区由idhit升序排列。

Now, I would like to create a new column max_hit_item which contains the name of the item with the highest row_number per user.现在,我想创建一个新列max_hit_item ,其中包含每个用户具有最高row_number的项目的名称。

So in our example, it would return,所以在我们的例子中,它会返回,

id, hit, item, row_number, max_hit_item
1234, 1, item1, 1, item2
1234, 2, item2, 2, item2
2345, 2, item1, 1, item3
2345, 2, item2, 2, item3
2345, 4, item3, 3, item3

I'm thinking since I do not want to drop any rows, I will have to use windows function.我在想,因为我不想删除任何行,所以我将不得不使用 windows 功能。 Is there a unique way of using windows function to achieve this?是否有使用 windows 函数的独特方法来实现这一点? Ideally, I would like to not use join but any solutions are welcome.理想情况下,我不想使用 join 但欢迎任何解决方案。

Using window function first . first使用窗口函数。

import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("id").orderBy(desc("hit"))
val result = df.withColumn("max_hit_item", first("item").over(w))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM