[英]Partition functions in spark scala
DF:
ID col1 . .....coln.... Date
1 1991-01-11 11:03:46.0
1 1991-01-11 11:03:46.0
1 1991-02-22 12:05:58.0
1 1991-02-22 12:05:58.0
1 1991-02-22 12:05:58.0
我正在创建一个新列“ identify”以查找(ID,DATE)的分区,并按“ identify”的顺序选择最上面的组合
预期DF:
ID col1 . .....coln.... Date . identify
1 1991-01-11 11:03:46.0 . 1
1 1991-01-11 11:03:46.0 1
1 1991-02-22 12:05:58.0 . 2
1 1991-02-22 12:05:58.0 . 2
1 1991-02-22 12:05:58.0 . 2
代码尝试1:
var window = Window.partitionBy("ID").orderBy("DATE")
df = df.orderBy($"DATE").withColumn("identify", row_number().over(window))
我的OP:
ID col1 . .....coln.... Date . identify
1 1991-01-11 11:03:46.0 . 1
1 1991-01-11 11:03:46.0 2
1 1991-02-22 12:05:58.0 . 3
1 1991-02-22 12:05:58.0 . 4
1 1991-02-22 12:05:58.0 . 5
代码尝试2:
var window = Window.partitionBy("ID","DATE").orderBy("DATE")
df = df.orderBy($"DATE").withColumn("identify", row_number().over(window))
我的OP:
ID col1 . .....coln.... Date . identify
1 1991-01-11 11:03:46.0 . 1
1 1991-01-11 11:03:46.0 2
1 1991-02-22 12:05:58.0 . 1
1 1991-02-22 12:05:58.0 . 2
1 1991-02-22 12:05:58.0 . 3
关于如何调整代码以获取所需OP的任何建议都将有所帮助
var window = Window.partitionBy("ID").orderBy("DATE")
df = df.orderBy($"DATE").withColumn("identify", dense_rank().over(window))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.