[英]Groupby column and create lists for another column values in pyspark
I have a data frame as below:我有一个数据框如下:
dummy = pd.DataFrame([[1047,2021,0.38],[1056,2021,0.19]],columns=['reco','user','score'])
dummy
reco user score
0 1047 2021 0.38
1 1056 2021 0.19
I want the output to look like this:我希望 output 看起来像这样:
user score reco
2021 [0.38, 0.19] [1047, 1056]
I want to group by user, and then the lists should be created by score in descending order and the reco should be corresponding to its score values.我想按用户分组,然后应按分数按降序创建列表,并且记录应与其分数值相对应。
I tried collect_list but the order changes.我尝试了 collect_list 但顺序发生了变化。 I want to keep the same order.
我想保持相同的顺序。
You can preserve ordering by applying collect_list
over the window function.您可以通过在 window function 上应用
collect_list
来保留排序。 In this case the window is partitioned by user
and ordered by score descending
.在这种情况下, window 由
user
分区并按score descending
排序。
import pandas as pd
from pyspark.sql import functions as F
from pyspark.sql import Window as W
dummy = pd.DataFrame([[1047,2021,0.38],[1056,2021,0.19]],columns=['reco','user','score'])
df = spark.createDataFrame(dummy)
window_spec = W.partitionBy("user").orderBy(F.desc("score"))
ranged_spec = window_spec.rowsBetween(W.unboundedPreceding, W.unboundedFollowing)
df.withColumn("reco", F.collect_list("reco").over(window_spec))\
.withColumn("score", F.collect_list("score").over(window_spec))\
.withColumn("rn", F.row_number().over(window_spec))\
.where("rn == 1")\
.drop("rn").show()
+------------+----+------------+
| reco|user| score|
+------------+----+------------+
|[1047, 1056]|2021|[0.38, 0.19]|
+------------+----+------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.