简体繁体中英

SPARK: dropDuplicates in every partitions only

原文 2017-02-14 07:07:27 3 1 apache-spark/ dataframe/ pyspark

I want to dropDuplicates in every partitions, not the full DataFrame.

Is that possible with PySpark? Thanks.

1 answers

import pyspark.sql.functions as f
withNoDuplicates = df.withColumn("partitionID", f.spark_partition_id()).dropDuplicates()

Basically you add a column of the partition id using spark_partition_id and then do the distinct, it will consider different partitions separately

Spark dropDuplicates source code

Apache Spark SQL context dropDuplicates

Homemade DataFrame aggregation/dropDuplicates Spark

Spark How to Join Only Within Partitions

Overwrite only some partitions in a partitioned spark Dataset

Issue with dropDuplicates() and except() method in Spark using Scala

Spark SQL DataFrame - distinct() vs dropDuplicates()

spark dropDuplicates based on json array field

Hive partitions to Spark partitions

What is the equivalent of Spark Dataframe's dropDuplicates in Spark SQL?

暂无

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Spark dropDuplicates source code Apache Spark SQL context dropDuplicates Homemade DataFrame aggregation/dropDuplicates Spark Spark How to Join Only Within Partitions Overwrite only some partitions in a partitioned spark Dataset Issue with dropDuplicates() and except() method in Spark using Scala Spark SQL DataFrame - distinct() vs dropDuplicates() spark dropDuplicates based on json array field Hive partitions to Spark partitions What is the equivalent of Spark Dataframe's dropDuplicates in Spark SQL?

Related Tags

粤ICP备18138465号 © 2020-2024 STACKOOM.COM