pyspark: drop duplicates with exclusive subset

Question

I can use df1.dropDuplicates(subset=["col1","col2"]) to drop all rows that are duplicates in terms of the columns defined in the subset list.

Is it possible to have the same result by specifying the columns not to include in the subset list (something like df1.dropDuplicates(subset=~["col3","col4"]) ?

Thanks

Answer 1

df1.dropDuplicates(subset=[col for col in df1.columns if col not in ["col3","col4"]])

pyspark: drop duplicates with exclusive subset

Question

1 answers

solution1
1 ACCPTED 2020-11-27 08:44:36

pyspark: drop duplicates with exclusive subset

Question

1 answers

solution1 1 ACCPTED 2020-11-27 08:44:36

solution1
1 ACCPTED 2020-11-27 08:44:36