Pyspark drop_duplicates(keep=False)

Question

我需要 Pandas drop_duplicates(keep=False)的Pyspark解決方案。 不幸的是， keep=False選項在 pyspark 中不可用...

熊貓示例：

import pandas as pd

df_data = {'A': ['foo', 'foo', 'bar'], 
         'B': [3, 3, 5],
         'C': ['one', 'two', 'three']}
df = pd.DataFrame(data=df_data)
df = df.drop_duplicates(subset=['A', 'B'], keep=False)
print(df)

預期輸出：

     A  B       C
2  bar  5  three

轉換.to_pandas()並返回到 pyspark 不是一種選擇。

謝謝！

Answer 1

使用窗口函數計算每個A / B組合的行數，然后過濾結果以僅保留唯一的行：

import pyspark.sql.functions as f

df.selectExpr(
  '*', 
  'count(*) over (partition by A, B) as cnt'
).filter(f.col('cnt') == 1).drop('cnt').show()

+---+---+-----+
|  A|  B|    C|
+---+---+-----+
|bar|  5|three|
+---+---+-----+

或者使用pandas_udf另一種選擇：

from pyspark.sql.functions import pandas_udf, PandasUDFType

# keep_unique returns the data frame if it has only one row, otherwise 
# drop the group
@pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def keep_unique(df):
    return df.iloc[:0] if len(df) > 1 else df

df.groupBy('A', 'B').apply(keep_unique).show()
+---+---+-----+
|  A|  B|    C|
+---+---+-----+
|bar|  5|three|
+---+---+-----+

Answer 2

簡單的方法是計算這些行，然后只選擇那些出現一次的行，以避免任何行重復，然后刪除額外的列。

df= df.groupBy('A', 'B').agg(f.expr('count(*)').alias('Frequency'))
df=df.select('*').where(df.Frequency==1)
df=df.drop('Frequency')

Pyspark drop_duplicates(keep=False)

問題描述

2 個解決方案

解決方案1
2 已采納 2019-01-09 19:00:49

解決方案2
0 2020-10-06 11:32:26

Pyspark drop_duplicates(keep=False)

問題描述

2 個解決方案

解決方案1 2 已采納 2019-01-09 19:00:49

解決方案2 0 2020-10-06 11:32:26

解決方案1
2 已采納 2019-01-09 19:00:49

解決方案2
0 2020-10-06 11:32:26