Get distinct row count in pyspark

Question

I have below data frame in pyspark. I want to check for every row if it is unique value in a data frame.

Below is the dataframe.

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import pyspark.sql.functions as F



data=[["1","2020-02-01"],["2","2019-03-01"],["3","2021-03-01"],["4",""],["5","2021-21-01"],["6","1900-01-01"],["6","2000-01-01"]]
df=spark.createDataFrame(data,["id","input"])
df.show()

id	input
1	2020-02-01
2	2020-02-01
3	2019-03-01
4
5	2021-21-01
6	1900-01-01
6	2000-01-01

I am looking to get the count of rows to find if the row is unique or not. Below is the output what I am looking for.

id	CountUnique
1	1
2	1
3	1
4	1
5	1
6	2
6	2

The below code will give me the count by grouping howerver i need to show count for every row. For example 6 should show two times with 2 row count.

df.groupBy("id").count().orderBy("id").show().select("id")

Answer 1

You can count with window functions, ie count(*) over (partition by id) :

df.withColumn('count', F.expr('count(*) over (partition by id)')).show()

+---+----------+-----+
| id|     input|count|
+---+----------+-----+
|  3|2021-03-01|    1|
|  5|2021-21-01|    1|
|  6|1900-01-01|    2|
|  6|2000-01-01|    2|
|  1|2020-02-01|    1|
|  4|          |    1|
|  2|2019-03-01|    1|
+---+----------+-----+

Get distinct row count in pyspark

Question

1 answers

solution1
1 ACCPTED 2021-05-30 08:06:54

Get distinct row count in pyspark

Question

1 answers

solution1 1 ACCPTED 2021-05-30 08:06:54

solution1
1 ACCPTED 2021-05-30 08:06:54