简体   繁体   中英

pyspark: drop duplicates with exclusive subset

I can use df1.dropDuplicates(subset=["col1","col2"]) to drop all rows that are duplicates in terms of the columns defined in the subset list.

Is it possible to have the same result by specifying the columns not to include in the subset list (something like df1.dropDuplicates(subset=~["col3","col4"]) ?

Thanks

df1.dropDuplicates(subset=[col for col in df1.columns if col not in ["col3","col4"]])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM