[英]Get last vision of one column after partitionBy function
+---------------+---------+-----------------+-------+-------------------+-----------+--------------------+--------------------+---------------+-------+-------------------+-------------------+
|ID_NOTIFICATION|ID_ENTITE|ID_ENTITE_GARANTE|CD_ETAT|DT_ETAT |CD_ANOMALIE|CD_TYPE_DESTINATAIRE|CD_TYPE_EVENEMENT |CD_SYS_APPELANT|TYP_MVT|DT_DEBUT |DT_FIN |
+---------------+---------+-----------------+-------+-------------------+-----------+--------------------+--------------------+---------------+-------+-------------------+-------------------+
|3110305 |GNE |GNE |AT |2019-06-12 00:03:14|null |null |REL_CP_ULTIME_PAPIER|SIGMA |C |2019-06-12 00:03:22|2019-06-12 00:03:32|
|3110305 |GNE |GNE |AN |2019-06-12 00:03:28|017 |IDGRC |REL_CP_ULTIME_PAPIER|SIGMA |M |2019-06-12 00:03:22|2019-06-12 15:08:43|
|3110305 |GNE |GNE |AN |2019-06-12 00:03:28|017 |IDGRC |REL_CP_ULTIME_PAPIER|SIGMA |M |2019-06-12 00:03:22|2019-06-12 15:10:06|
|3110305 |GNE |GNE |AN |2019-06-12 15:10:02|017 |IDGRC |REL_CP_ULTIME_PAPIER|SIGMA |M |2019-06-12 00:03:22|2019-06-12 15:10:51|
|3110305 |GNE |GNE |AN |2019-06-12 15:10:02|017 |IDGRC |REL_CP_ULTIME_PAPIER|SIGMA |M |2019-06-12 00:03:22|2019-06-12 15:11:35|
我使用PartitionBy函数仅获得每个不同CD_ETAT
列的一行:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val window = Window.partitionBy("CD_ETAT").orderBy("DT_ETAT")
df.withColumn("row_num", row_number().over(window))
.filter($"row_num" === 1)
.drop("row_num")
输出:
+---------------+---------+-----------------+-------+-------------------+-----------+--------------------+--------------------+---------------+-------+-------------------+-------------------+
|ID_NOTIFICATION|ID_ENTITE|ID_ENTITE_GARANTE|CD_ETAT| DT_ETAT|CD_ANOMALIE|CD_TYPE_DESTINATAIRE| CD_TYPE_EVENEMENT|CD_SYS_APPELANT|TYP_MVT| DT_DEBUT| DT_FIN|
+---------------+---------+-----------------+-------+-------------------+-----------+--------------------+--------------------+---------------+-------+-------------------+-------------------+
| 3110305| GNE| GNE| AT|2019-06-12 00:03:14| null| null|REL_CP_ULTIME_PAPIER| SIGMA| C|2019-06-12 00:03:22|2019-06-12 00:03:32|
| 3110305| GNE| GNE| AN|2019-06-12 00:03:28| 017| IDGRC|REL_CP_ULTIME_PAPIER| SIGMA| M|2019-06-12 00:03:22|2019-06-12 15:08:43|
+---------------+---------+-----------------+-------+-------------------+-----------+--------------------+--------------------+---------------+-------+-------------------+-------------------+
我的问题是:有没有办法修改代码,以便对应于每个CD_ETAT的每一行都获得对应于其最后一个愿景而不是第一个愿景的DT_FIN。
所需输出:
+---------------+---------+-----------------+-------+-------------------+-----------+--------------------+--------------------+---------------+-------+-------------------+-------------------+
|ID_NOTIFICATION|ID_ENTITE|ID_ENTITE_GARANTE|CD_ETAT| DT_ETAT|CD_ANOMALIE|CD_TYPE_DESTINATAIRE| CD_TYPE_EVENEMENT|CD_SYS_APPELANT|TYP_MVT| DT_DEBUT| DT_FIN|
+---------------+---------+-----------------+-------+-------------------+-----------+--------------------+--------------------+---------------+-------+-------------------+-------------------+
| 3110305| GNE| GNE| AT|2019-06-12 00:03:14| null| null|REL_CP_ULTIME_PAPIER| SIGMA| C|2019-06-12 00:03:22|2019-06-12 00:03:32|
| 3110305| GNE| GNE| AN|2019-06-12 00:03:28| 017| IDGRC|REL_CP_ULTIME_PAPIER| SIGMA| M|2019-06-12 00:03:22|2019-06-12 15:11:35|
+---------------+---------+-----------------+-------+-------------------+-----------+--------------------+--------------------+---------------+-------+-------------------+-------------------+
为此,您需要以下两个Window
函数
val window = Window.partitionBy("CD_ETAT").orderBy("DT_ETAT")
val window1 = Window.partitionBy("CD_ETAT").orderBy($"DT_FIN".desc)
//groupBy CD_ETAT and get the last DT_FIN and
df.withColumn("DT_FIN", first($"DT_FIN").over(window1))
//groupBy CD_ETAT and get first DT_ETAT
.withColumn("row_num", row_number().over(window))
.filter($"row_num" === 1 )
.drop("row_num")
.show()
输出:
+---------------+---------+-----------------+-------+-------------------+-----------+--------------------+--------------------+---------------+-------+-------------------+-------------------+
|ID_NOTIFICATION|ID_ENTITE|ID_ENTITE_GARANTE|CD_ETAT| DT_ETAT|CD_ANOMALIE|CD_TYPE_DESTINATAIRE| CD_TYPE_EVENEMENT|CD_SYS_APPELANT|TYP_MVT| DT_DEBUT| DT_FIN|
+---------------+---------+-----------------+-------+-------------------+-----------+--------------------+--------------------+---------------+-------+-------------------+-------------------+
| 3110305| GNE| GNE| AT|2019-06-12 00:03:14| null| null|REL_CP_ULTIME_PAPIER| SIGMA| C|2019-06-12 00:03:22|2019-06-12 00:03:32|
| 3110305| GNE| GNE| AN|2019-06-12 00:03:28| 017| IDGRC|REL_CP_ULTIME_PAPIER| SIGMA| M|2019-06-12 00:03:22|2019-06-12 15:11:35|
+---------------+---------+-----------------+-------+-------------------+-----------+--------------------+--------------------+---------------+-------+-------------------+-------------------+
如果要除一个字段外的所有字段的第一行,可以将last
函数与窗口结合使用:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val window = Window.partitionBy("CD_ETAT").orderBy("DT_ETAT")
df.withColumn("row_num", row_number() over window)
.withColumn("DT_FIN", last($"DT_FIN") over window) //Extract last DT_FIN value of window
.filter($"row_num" === 1)
.drop("row_num")
这样,无需新窗口。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.