简体   繁体   中英

Get distinct row count in pyspark

I have below data frame in pyspark. I want to check for every row if it is unique value in a data frame.

Below is the dataframe.

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import pyspark.sql.functions as F



data=[["1","2020-02-01"],["2","2019-03-01"],["3","2021-03-01"],["4",""],["5","2021-21-01"],["6","1900-01-01"],["6","2000-01-01"]]
df=spark.createDataFrame(data,["id","input"])
df.show()

id input
1 2020-02-01
2 2020-02-01
3 2019-03-01
4
5 2021-21-01
6 1900-01-01
6 2000-01-01

I am looking to get the count of rows to find if the row is unique or not. Below is the output what I am looking for.

id CountUnique
1 1
2 1
3 1
4 1
5 1
6 2
6 2

The below code will give me the count by grouping howerver i need to show count for every row. For example 6 should show two times with 2 row count.

df.groupBy("id").count().orderBy("id").show().select("id")

You can count with window functions, ie count(*) over (partition by id) :

df.withColumn('count', F.expr('count(*) over (partition by id)')).show()

+---+----------+-----+
| id|     input|count|
+---+----------+-----+
|  3|2021-03-01|    1|
|  5|2021-21-01|    1|
|  6|1900-01-01|    2|
|  6|2000-01-01|    2|
|  1|2020-02-01|    1|
|  4|          |    1|
|  2|2019-03-01|    1|
+---+----------+-----+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM