I have below data frame in pyspark. I want to check for every row if it is unique value in a data frame.
Below is the dataframe.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import pyspark.sql.functions as F
data=[["1","2020-02-01"],["2","2019-03-01"],["3","2021-03-01"],["4",""],["5","2021-21-01"],["6","1900-01-01"],["6","2000-01-01"]]
df=spark.createDataFrame(data,["id","input"])
df.show()
id | input |
---|---|
1 | 2020-02-01 |
2 | 2020-02-01 |
3 | 2019-03-01 |
4 | |
5 | 2021-21-01 |
6 | 1900-01-01 |
6 | 2000-01-01 |
I am looking to get the count of rows to find if the row is unique or not. Below is the output what I am looking for.
id | CountUnique |
---|---|
1 | 1 |
2 | 1 |
3 | 1 |
4 | 1 |
5 | 1 |
6 | 2 |
6 | 2 |
The below code will give me the count by grouping howerver i need to show count for every row. For example 6 should show two times with 2 row count.
df.groupBy("id").count().orderBy("id").show().select("id")
You can count with window functions, ie count(*) over (partition by id)
:
df.withColumn('count', F.expr('count(*) over (partition by id)')).show()
+---+----------+-----+
| id| input|count|
+---+----------+-----+
| 3|2021-03-01| 1|
| 5|2021-21-01| 1|
| 6|1900-01-01| 2|
| 6|2000-01-01| 2|
| 1|2020-02-01| 1|
| 4| | 1|
| 2|2019-03-01| 1|
+---+----------+-----+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.