![](/img/trans.png)
[英]how to sample from a dataset and get the indices of samples in initial dataset
[英]How to get unique key from Dataset Spark
我有小数据集:
+-------------------+-------------+--------------+-------+-------------+
| session_id| insert_dttm| key| value| process_name|
+-------------------+-------------+--------------+-------+-------------+
|local-1641922005078|1641922023703|test_file1.csv|Success|ProcessResult|
|local-1641922005078|1641922023704|test_file1.csv|Success|ProcessResult|
|local-1641922005078|1641922023705|test_file2.csv|Success|ProcessResult|
|local-1641922005078|1641922023706|test_file2.csv|Success|ProcessResult|
|local-1641922005078|1641922023707|test_file3.csv|Success|ProcessResult|
|local-1641922005078|1641922023708|test_file3.csv|Success|ProcessResult|
+-------------------+-------------+--------------+-------+-------------+
我想在最近一次仅通过唯一键值获取新数据集。
示例 Output 数据集:
+-------------------+-------------+--------------+-------+-------------+
| session_id| insert_dttm| key| value| process_name|
+-------------------+-------------+--------------+-------+-------------+
|local-1641922005078|1641922023704|test_file1.csv|Success|ProcessResult|
|local-1641922005078|1641922023706|test_file2.csv|Success|ProcessResult|
|local-1641922005078|1641922023708|test_file3.csv|Success|ProcessResult|
+-------------------+-------------+--------------+-------+-------------+
如何在不使用 SQL 的情况下使用 Spark API 获得这样的数据集?
您可以使用此代码片段使用 scala 对行进行重复数据删除:
val dataframe= (... your dataframe ...)
val rankColumn = "rank"
val window = Window.partitionBy(col("session_id"),col("key"),col("value"),col("process_name")).orderBy(col("insert_dttm").desc)
val deduplicatedDf = dataframe.withColumn(rankColumn, row_number over window).filter(col(rankColumn) === 1)
这可能有效:
import org.apache.spark.sql.functions.col
my_df.groupBy(
my_df.columns.filterNot(_=="timestamp_column").map(col(_)):_*
).agg(max(my_df("timestamp_column")))
只需将my_df
和timestamp_column
替换为正确的名称。
这与在 SQL 中执行此操作基本相同:
SELECT
MAX(timestamp_column),
<all the other columns>
GROUP BY
<all the other columns>
不需要window
function,如果可能的话最好避免。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.