Spark根据row_number的最大值和另一列的字符串值创建一个新的字符串列

Question

Suppose we have假设我们有

id, hit, item, row_number
1234, 1, item1, 1
1234, 2, item2, 2
2345, 2, item1, 1
2345, 2, item2, 2
2345, 4, item3, 3

where row_number was created from windows function partitioned by id on hit in ascending order.其中row_number从窗口创建功能分区由id上hit升序排列。

Now, I would like to create a new column max_hit_item which contains the name of the item with the highest row_number per user.现在，我想创建一个新列max_hit_item ，其中包含每个用户具有最高row_number的项目的名称。

So in our example, it would return,所以在我们的例子中，它会返回，

id, hit, item, row_number, max_hit_item
1234, 1, item1, 1, item2
1234, 2, item2, 2, item2
2345, 2, item1, 1, item3
2345, 2, item2, 2, item3
2345, 4, item3, 3, item3

I'm thinking since I do not want to drop any rows, I will have to use windows function.我在想，因为我不想删除任何行，所以我将不得不使用 windows 功能。 Is there a unique way of using windows function to achieve this?是否有使用 windows 函数的独特方法来实现这一点？ Ideally, I would like to not use join but any solutions are welcome.理想情况下，我不想使用 join 但欢迎任何解决方案。

Answer 1

Using window function first . first使用窗口函数。

import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("id").orderBy(desc("hit"))
val result = df.withColumn("max_hit_item", first("item").over(w))

Spark根据row_number的最大值和另一列的字符串值创建一个新的字符串列

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-03-31 20:03:51

Spark根据row_number的最大值和另一列的字符串值创建一个新的字符串列

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-03-31 20:03:51

解决方案1
1 已采纳 2020-03-31 20:03:51