[英]How to retrieve unique values in each window in pyspark dataframe
我有以下火花數據框:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('').getOrCreate()
df = spark.createDataFrame([(1, "a", "2"), (2, "b", "2"),(3, "c", "2"), (4, "d", "2"),
(5, "b", "3"), (6, "b", "3"),(7, "c", "2")], ["nr", "column2", "quant"])
返回我:
+---+-------+------+
| nr|column2|quant |
+---+-------+------+
| 1| a| 2|
| 2| b| 2|
| 3| c| 2|
| 4| d| 2|
| 5| b| 3|
| 6| b| 3|
| 7| c| 2|
+---+-------+------+
我想檢索行中的每個3分組的行(從每個窗口的窗口大小為3)量化列具有唯一的值。 如下圖所示:
這里紅色是窗口大小,每個窗口我只保留綠色,其中quant是唯一的:
我想得到的輸出如下:
+---+-------+------+
| nr|column2|values|
+---+-------+------+
| 1| a| 2|
| 4| d| 2|
| 5| b| 3|
| 7| c| 2|
+---+-------+------+
我是火花的新手,所以我將不勝感激。 謝謝
假設對3條記錄進行分組基於“ nr”列,則該方法應為您所用。
使用udf
來確定是否應選擇一條記錄,然后使用lag
來獲取上一行數據。
def tag_selected(index, current_quant, prev_quant1, prev_quant2):
if index % 3 == 1: # first record in each group is always selected
return True
if index % 3 == 2 and current_quant != prev_quant1: # second record will be selected if prev quant is not same as current
return True
if index % 3 == 0 and current_quant != prev_quant1 and current_quant != prev_quant2: # third record will be selected if prev quant are not same as current
return True
return False
tag_selected_udf = udf(tag_selected, BooleanType())
df = spark.createDataFrame([(1, "a", "2"), (2, "b", "2"),(3, "c", "2"), (4, "d", "2"),
(5, "b", "3"), (6, "b", "3"),(7, "c", "2")], ["nr", "column2", "quant"])
window = Window.orderBy("nr")
df = df.withColumn("prev_quant1", lag(col("quant"),1, None).over(window))\
.withColumn("prev_quant2", lag(col("quant"),2, None).over(window)) \
.withColumn("selected",
tag_selected_udf(col('nr'),col('quant'),col('prev_quant1'),col('prev_quant2')))\
.filter(col('selected') == True).drop("prev_quant1","prev_quant2","selected")
df.show()
結果
+---+-------+-----+
| nr|column2|quant|
+---+-------+-----+
| 1| a| 2|
| 4| d| 2|
| 5| b| 3|
| 7| c| 2|
+---+-------+-----+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.