如何刪除重復項但首先保留在 pyspark dataframe 中？

Question

我正在嘗試從數據框中刪除重復項，但不應刪除第一個條目。 除第一條記錄 rest 外，所有其他重復項都應存儲在一個單獨的數據幀中。

例如，如果數據框是這樣的：

col1,col2,col3,col4
r,t,s,t
a,b,c,d
b,m,c,d
a,b,c,d
a,b,c,d
g,n,d,f
e,f,g,h
t,y,u,o
e,f,g,h
e,f,g,h

在這種情況下，我應該有兩個數據框。

df1:
r,t,s,t
a,b,c,d
b,m,c,d
g,n,d,f
e,f,g,h
t,y,u,o

和其他數據框應該是：

a,b,c,d
a,b,c,d
e,f,g,h
e,f,g,h

Answer 1

嘗試使用window row_number() function。

Example:

df.show()
#+----+----+----+----+
#|col1|col2|col3|col4|
#+----+----+----+----+
#|   r|   t|   s|   t|
#|   a|   b|   c|   d|
#|   b|   m|   c|   d|
#|   a|   b|   c|   d|
#|   a|   b|   c|   d|
#|   g|   n|   d|   f|
#|   e|   f|   g|   h|
#|   t|   y|   u|   o|
#|   e|   f|   g|   h|
#|   e|   f|   g|   h|
#+----+----+----+----+

from pyspark.sql import *
from pyspark.sql.functions import *

w=Window.partitionBy("col1","col2","col3","col4").orderBy(lit(1))


df1=df.withColumn("rn",row_number().over(w)).filter(col("rn")==1).drop("rn")

df1.show()
#+----+----+----+----+
#|col1|col2|col3|col4|
#+----+----+----+----+
#|   b|   m|   c|   d|
#|   r|   t|   s|   t|
#|   g|   n|   d|   f|
#|   t|   y|   u|   o|
#|   a|   b|   c|   d|
#|   e|   f|   g|   h|
#+----+----+----+----+
df2=df.withColumn("rn",row_number().over(w)).filter(col("rn")>1).drop("rn")
df2.show()
#+----+----+----+----+
#|col1|col2|col3|col4|
#+----+----+----+----+
#|   a|   b|   c|   d|
#|   a|   b|   c|   d|
#|   e|   f|   g|   h|
#|   e|   f|   g|   h|
#+----+----+----+----+

如何刪除重復項但首先保留在 pyspark dataframe 中？

問題描述

1 個解決方案

解決方案1
2 已采納 2020-08-10 16:54:56

如何刪除重復項但首先保留在 pyspark dataframe 中？

問題描述

1 個解決方案

解決方案1 2 已采納 2020-08-10 16:54:56

解決方案1
2 已采納 2020-08-10 16:54:56