繁体   English   中英

Spark Scala中的分区功能

[英]Partition functions in spark scala

DF:

ID col1 . .....coln....  Date
1                        1991-01-11 11:03:46.0
1                        1991-01-11 11:03:46.0
1                        1991-02-22 12:05:58.0
1                        1991-02-22 12:05:58.0
1                        1991-02-22 12:05:58.0

我正在创建一个新列“ identify”以查找(ID,DATE)的分区,并按“ identify”的顺序选择最上面的组合

预期DF:

ID col1 . .....coln....  Date .                    identify
1                        1991-01-11 11:03:46.0 .     1
1                        1991-01-11 11:03:46.0       1
1                        1991-02-22 12:05:58.0 .     2
1                        1991-02-22 12:05:58.0 .     2 
1                        1991-02-22 12:05:58.0 .     2

代码尝试1:

var window = Window.partitionBy("ID").orderBy("DATE")
df = df.orderBy($"DATE").withColumn("identify", row_number().over(window))

我的OP:

ID col1 . .....coln....  Date .                    identify
1                        1991-01-11 11:03:46.0 .     1
1                        1991-01-11 11:03:46.0       2
1                        1991-02-22 12:05:58.0 .     3
1                        1991-02-22 12:05:58.0 .     4
1                        1991-02-22 12:05:58.0 .     5

代码尝试2:

 var window = Window.partitionBy("ID","DATE").orderBy("DATE")
df = df.orderBy($"DATE").withColumn("identify", row_number().over(window))

我的OP:

ID col1 . .....coln....  Date .                    identify
1                        1991-01-11 11:03:46.0 .     1
1                        1991-01-11 11:03:46.0       2
1                        1991-02-22 12:05:58.0 .     1
1                        1991-02-22 12:05:58.0 .     2
1                        1991-02-22 12:05:58.0 .     3

关于如何调整代码以获取所需OP的任何建议都将有所帮助

var window = Window.partitionBy("ID").orderBy("DATE")
df = df.orderBy($"DATE").withColumn("identify", dense_rank().over(window))

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM